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19  continued 

Although  path-selection  strategies  vary  from  network  to  network,  we  show  that  there  is  an 
efficient  on-line  scheduling  algorithm  for  the  entire  class  of  layered  networks.  When  applied  to 
an  A-packel  problem,  t|«e  algorithm  produces  a  schedule  of  length  0(c  +  d  +  log  A),  with  high 
probability. 

The  algorithm  has  many  applications  to  routing  and  sorting.  Among  them  are  the  first  on¬ 
line  algorithms  for  routing  jY-packets  on  an  A'-node  shuffle-exchange  graph  in  0(log  A)  steps 
using  constant-size  queues  and  for  routing  JfcA  packets  on  an  A-node  Jt-dimensional  array  with 
maximum  side  length  M  in  O(kM)  steps  using  constant-size  queues.  The  scheduling  algorithm 
can  also  be  used  as  a  subroutine  in  sorting  algorithms.  It  yields  the  first  asymptotically  optimal 
algorithms  for  sorting  on  butterfly,  shuffle-exchange,  and  multidimensional  array  networks  using 
constant-size  queues. 

The  algorithm  can  also  be  applied  to  the  construction  of  area-universal  networks:  A-node 
networks  with  VLSI-layout  area  0(A)  that  can  simulate  all  other  networks  with  area  O(iY) 
with  only  0(log  A)  slowdown. 

In  Chapter  1  we  also  prove  the  existence  of  a  schedule  of  length  0(c  +  d)  for  any  set  of 
packets  whose  paths  have  congestion  c  and  dilation  d  (in  any  network)  that  uses  constant-size 
queues.  Unfortunately,  no  efficient  algorithm  for  constructing  the  schedule  is  known. 

Chapter  2  introduces  a  model  for  parallel  computation,  called  the  distributed  random-access 
machine  (DRAM),  in  which  the  communication  requirements  of  parallel  algorithms  can  be 
evaluated.  A  DRAM  is  an  abstraction  of  a  parallel  computer  in  which  memory  accesses  are 
implemented  by  routing  messages  through  a  communication  network.  It  explicitly  models  the 
congestion  of  messages  across  cuts  of  the  network. , 

We  introduce  the  notion  of  a  conservative  algorithm  as  one  whose  communication  require¬ 
ments  at  each  step  can  be  bounded  by  the  congestion  of  poiuters  of  the  input  data  structure 
across  cuts  of  a  DRAM.  A  conservative  algorithm  is  guaranteed  not  to  generate  undo  conges¬ 
tion  in  any  underlying  network.  Chapter  2  presents  conservative  algorithms  for  a  variety  of 
graph  problems.  Problems  such  as  computing  treewalk  numberings,  finding  the  separator  of  a 
tree,  and  evaluating  all  subexpressions  in  an  expression  tree  can  be  solved  in  0(log  A)  steps 
for  A-node  trees  by  conservative  algorithms  for  an  exclusive-read  exclusive- write  DRAM.  More 
com  ilex  problems  such  as  finding  a  minimum-cost  spanning  forest,  computing  biconnected 
components  and  constructing  an  Eulerian  cycle  require  O(log3  A)  steps,  for  graphs  of  size  A . 
For  concurrent-read  concurrent-write  DRAM’s,  all  of  these  problems  can  be  solved  by  0( log  A) 
step  conservative  algorithms, 

Chapter  3  examines  the  problem  of  how  efficiently  a  host  network  can  emulat'd  a  guest 
network.  The  goal  is  to  emulate  To  steps  of  an  Ac-node  pest  network  on  an  A h  node 
host  network.  We  call  an  emulation  uiork-presei  v'.ntj  if  the  time  required  by  the  host,  Th  is 
0(Tc/Vc/Ah)  because  then  both  the  guest  and  host  networks  perform  the  same  amount  of 
total  work  (processor-time  product),  0(TgAg),  to  within  a  constant  factor.  A  work-preserving 
emulation  is  efficient  because  it  achieves  optimal  speedup  over  a  sequential  emulation  of  the 
guest.  We  nay  that  art  emulation  is  real-time  if  Th  =  O(Tc),  because  then  the  host  emulates 
the  guest  with  constant  delay. 

Although  many  isolated  emulation  results  have  been  proved  for  specific  networks  in  the 
i  past,  and  measures  such  as  dilation  and  congestion  were  known  to  be  important,  the  field  has 
lacked  a  model  within  which  general  results  and  meaningful  lower  bounds  could  be  proved.  We 
attempt  to  provide  such  a  model,  along  with  techniques  for  proving  lower  bounds  based  on 
comparing  the  locality  the  networks. 

Some  of  the  more  interesting  and  diverse  results  in  Chapter  3  include  a  proof  that  a  linear 
array  can  emulate  a  (much  larger)  butterfly  in  a  work-preserving  fashion,  but  that  a  butterfly 
cannot  emulate  an  expander  (of  any  size)  in  a  work-preserving  fashion;  a  proof  that  a  mesh  can 
be  emulated  in  real  time  in  a  work-preserving  fashion  on  a  butterfly,  even  though  any  O(l)-to- 1 
embedding  of  the  mesh  has  dilation  fl(log  .V);  and  a  proof  that  an  A-node  butterfly  can  emulate 
an  A  log  A-node  shuffle-exchange  graph  in  a  work-preserving  fashion,  and  vice-versa. 

Chapter  4  presents  an  algorithm  for  finding  a  minimum-cost  spanning  tree  of  an  A-node 
graph  on  an  Ax  A  mesh-connected  computer.  The  algorithm  has  the  same  0(A)  running  time 
as  the  previous  algorithms,  but  it  is  much  simpler.  :  - 
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Abstract 

This  thesis  explores  strategies  for  exploiting  locality  in  three  major  areas  of  parallel  computation: 
packet  routing,  graph  algorithms,  and  network  emulations.  Each  of  these  areas  is  covered  by  a 
separate  chapter. 

Chapter  1  describes  a  network-independent  approach  to  the  packet-routing  problem.  Our 
strategy  is  to  partition  the  problem  into  two  stages:  a  path-selection  stage  and  a  scheduling 
stage.  In  the  first  stage  we  And  paths  for  the  packets  with  small  congestion,  c,  and  dilation,  d. 
Once  the  paths  are  fixed,  both  arc  lower  bounds  on  the  time  required  to  deliver  the  packets.  In 
the  second  stage  we  find  a  schedule  for  the  movement  of  each  packet  along  its  path  so  that  no 
two  packets  traverse  the  same  edge  at  the  same  time,  and  so  that  the  total  time  and  maximum 
queue  size  required  to  route  all  of  the  packets  to  their  destinations  arc  minimized. 

Although  path-selection  strategies  vary  from  network  to  network,  we  show  that  there  is  an 
efficient  on-line  scheduling  algorithm  for  the  entire  class  of  layered  networks.  When  applied  to 
an  W-packet  problem,  the  algorithm  produces  a  schedule  of  length  0(c  +  d  +  log  JV),  with  high 
probability. 

The  algorithm  has  many  applications  to  routing  and  sorting.  Among  them  are  the  first  on¬ 
line  algorithms  for  routing  Ar-packcts  on  an  A'-nodc  shuffle-exchange  graph  in  0(log  Ar)  steps 
using  constant-size  queues  and  for  routing  kN  packets  on  an  Ar-nodc  ^-dimensional  array  with 
maximum  side  length  M  in  0(kM)  steps  using  constant-size  queues.  The  scheduling  algorithm 
can  also  be  used  as  a  subroutine  in  sorting  algorithms.  It  yields  the  first  asymptotically  optimal 
algorithms  for  sorting  on  butterfly,  shuffle-exchange,  and  multidimensional  array  networks  using 
constant-size  queues. 

The  algorithm  can  also  be  applied  to  the  construction  of  area-universal  networks:  Ar-node 
networks  with  VLSI-layout  area  0(N)  that  can  simulate  all  other  networks  with  area  0(N) 
with  only  O(logAT)  slowdown. 

In  Chapter  1  we  also  prove  the  existence  of  a  schedule  of  length  0(c  +  d )  for  any  set  of 
packets  whose  paths  have  congestion  c  and  dilation  d  (in  any  network)  that  uses  constant-size 
queues.  Unfortunately,  no  efficient  algorithm  for  constructing  the  schedule  is  known. 

Chapter  2  introduces  a  model  for  parallel  computation,  called  the  distributed  random-access 
machine  (DRAM),  in  which  the  communication  requirements  of  parallel  algorithms  can  be 
evaluated.  A  DRAM  is  an  abstraction  of  a  parallel  computer  in  which  memory  accesses  are 
implemented  by  routing  messages  through  a  communication  network.  It  explicitly  models  the 
congestion  of  messages  across  cuts  of  the  network. 


‘I 

We  introduce  the  notion  of  a,  conservative  algorithm  as  one  whose  communication  require¬ 
ments  at  each  step  can  be  bounded  by  the  congestion  of  pointers  of  the  input  data  structure 
across  cuts  of  DRAM.  A  conservative  algorithm  is  guaranteed  not  to  generate  undo  conges¬ 
tion  in  any  underlying  network  Chapter  2  presents  conservative  algorithms  for  a  variety  of 
graph  problems.  Problems  such  as  computing  trcewalk  numberings,  finding  the  separator  of  a 
tree,  and  evaluating  all  subexpressions  in  an  expression  tree  can  be  solved  in  0(log  jV)  steps 
for  jY-nodc  trees  by  conservative  algorithms  for  an  exclusive-read  exclusive* write  DRAM.  More 
complex  problems  such  as  finding  a  minimum-cost  spanning  forest,  computing  biconnectcd 
components  and  constructing  an  Eulerian  cycle  require  0(log2  N)  steps,  for  graphs  of  size  N. 
For  concurrent-read  concurrent-write  D  RAM's,  all  of  these  problems  can  be  solved  by  0(log  A') 
step  conservative  algorithms. 

Chapter  3  examines  the  problem  of  how  efficiently  a  host  network  can  emulate  a  guest 
network.  The  goal  is  to  emulate  To  steps  of  an  A'c-nodo  guest  network  on  an  Nn  node 
host  network.  We  call  an  emulation  teorh-preterving  if  the  time  required  by  the  host,  Tit  is 
0(TgNc/Nh)  because  then  both  the  guest  and  host  networks  perform  the  same  amount  of 
total  work  (processor-time  product),  0(2bArc),  to  within  a  constant  factor.  A  work-preserving 
emulation  is  efficient  because  it  achieves  optimal  speedup  over  a  sequential  emulation  of  the 
guest.  We  say  that  an  emulation  is  real-time  if  ?//  =  0[Tc),  because  then  the  host  emulates 
the  guest  with  constant  delay. 

Although  many  isolated  emulation  results  have  been  proved  for  specific  networks  in  the 
past,  and  measures  such  as  dilation  and  congestion  were  known  to  be  important,  the  field  has 
lacked  a  model  within  which  general  results  and  meaningful  lower  bounds  could  be  proved.  We 
attempt  to  provide  such  a  model,  along  with  techniques  for  proving  lower  bounds  based  on 
comparing  the  locality  the  networks. 

Some  of  the  more  interesting  and  diverse  results  in  Chapter  3  include  a  proof  that  a  linear 
array  can  emulate  a  (much  larger)  butterfly  in  a  work-preserving  fashion,  but  that  a  butterfly 
cannot  emulate  an  expander  (of  any  size)  in  a  work- preserving  fashion;  a  proof  that  a  mesh  can 
be  emulated  in  real  time  in  a  work-preserving  fashion  on  a  butterfly,  even  though  any  O(l)-to-l 
embedding  of  the  mesh  has  dilation  fl(log  N)\  and  a  proof  that  an  N -node  butterfly  can  emulate 
an  Arlog  Af-node  shuffle-exchange  graph  in  a  work-preserving  fashion,  and  vice-versa. 

Chapter  -t  presents  an  algorithm  for  finding  a  minimum-cost  spanning  tree  of  an  Af-node 
graph  on  an  Ar  X  Af  mesh -connected  computer.  The  algorithm  has  the  same  O(N)  running  time 
as  the  previous  algorithms,  but  it  is  much  simpler. 

Keywords:  parallel  computation,  fixed-connection  networks,  packet  routing  algorithms,  area- 
universal  networks,  fat-trees,  distributed  random-access  machines,  graph  algorithms,  network 
emulations. 
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Introduction 


This  thesis  explores  strategies  for  exploiting  locality  in  parallel  computation.  Locality  is  perhaps 
best  illustrated  by  the  telephone  system.  A  local  phone  call  exhibits  locality  because  it  is 
transmitted  over  a  small  physical  distance  and  through  few  switching  stations.  On  the  other 
hand,  a  long  distance  call  may  pass  through  many  switching  stations  and  span  the  globe.  The 
telephone  company  exploits  the  aggregate  locality  of  a  typical  set  of  phone  calls  by  allocating 
more  resources  to  local  calls  than  to  long  distance  calls.  The  communications  hardware  is 
arranged  in  a  hierarchy,  with  bushy  local  networks  at  the  bottom  and  a  sparser  satellite  system 
at  the  top.  The  phone  system  itself  may  be  said  to  exhibit  locality  in  the  sense  that  it  reflects 
the  locality  of  a  typical  set  of  calls. 

The  routing  network  in  a  paralhl  computer  has  a  job  much  like  that  of  the  phone  system. 
It  must  deliver  packets  of  information  between  different  processors.  In  this  thesis,  however,  we 
shall  restrict  our  attention  to  networks  that  arc  more  tightly  coupled  than  the  phone  system. 
These  networks  route  packets  to  their  destinations  via  a  series  of  globally  synchronized  time 
steps.  We  model  a  routing  network  as  a  graph,  where  the  nodes  correspond  to  processors  or 
switches,  and  the  edges  correspond  to  wires.  At  each  step  a  packet  can  either  traverse  an  edge 
or  wait  in  a  queue,  and  each  edge  can  transmit  at  most  one  packet.  The  time  to  deliver  a  set  of 
packets  is  the  equal  to  the  number  of  steps  required  for  every  packet  to  reach  its  destination. 


Packet  routing 

Two  important  measures  of  the  locality  of  a  set  of  packets  are  its  congestion  and  dilation.  The 
congestion ,  c,  of  a  set  of  packets  is  the  maximum  number  of  packets  that  use  any  edge  of  the 
network.  The  dilation,  d,  is  the  length  of  the  longest  path  taken  by  any  packet.  Both  of  these 
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measure*  arc  Liver  bounds  on  the  time  required  to  deliver  the  messages.  The  congestion  is 
a  lower  bound  because  c  packets  must  pass  through  some  edge,  and  at  most  one  packet  can 
traverse  the  edge  at  each  time  step.  The  dilation  is  a  lower  bound  because  some  packet  must 
travel  a  distance  of  d  and  it  can  travel  a  distance  of  at  most  one  in  each  time  step. 

Chapter  1  describes  a  network-independent  approach  to  the  packet  routing  problem.  Our 
strategy  is  to  partition  the  problem  into  two  stages:  a  path  selection  stage  and  a  scheduling 
stage.  The  path-selection  stage  varies  from  network  to  network.  Its  goal  is  to  find  a  set  of 
paths  for  the  packets  that  exhibits  locality,  i.c.,  has  small  congestion  and  dilation.  The  goal  of 
the  second  stage  is  to  determine  when  each  packet  should  move,  and  when  it  should  wait  in  a 
queue.  The  second  stage  must  ensure  that  at  most  one  packet  traverses  each  edge  at  each  time 
step.  It  should  exploit  the  locality  present  in  the  paths  produced  by  the  first  stage,  i.e.,  the 
time  to  deliver  the  packets  should  be  as  close  to  the  lower  bounds  c  and  d  as  possible  and  the 
queue  size  should  be  minimized. 

The  focus  of  Chapter  1  is  on  the  second  stage.  Two  main  scheduling  results  are  proved  there. 
First  we  show  that  there  is  a  schedule  of  length  0(e  +  d)  for  any  set  of  packets  with  congestion 
c  and  dilation  d  (in  any  network)  that  uses  constant-size  queues.  Unfortunately,  no  efficient 
algorithm  for  constructing  it  is  known.  However,  for  the  special  case  of  layered  networks,  we 
show  that  there  is  an  efficient  randomized  algorithm  for  routing  N  packets  in  0(c  +  d  + log  tf) 
steps  using  constant-size  queues. 

The  algorithm  for  routing  packets  on  layered  networks  has  many  applications  to  routing 
and  sorting.  Among  them  are  the  first  on-line  algorithms  for  routing  JV-packets  on  an  W-node 
shuffle-exchange  graph  in  O(log  N)  steps  using  constant-size  queues  and  for  routing  kN  packets 
on  an  iV-nodc  fc-dimensional  array  with  maximum  side  length  M  in  0(kM)  steps  using  constant- 
size  queues.  The  routing  algorithm  can  also  be  used  as  a  subroutine  in  sorting  algorithms.  It 
yields  the  first  asymptotically  optimal  algorithms  for  sorting  on  butterfly,  shuffle-exchange,  and 
multidimensional  array  networks  using  constant-size  queues. 

A  second  major  application  area  is  in  the  construction  of  area-universal  networks:  Ar-node 
networks  with  VLSI-layout  area  O(N)  that  can  simulate  ail  other  networks  with  area  0(N) 
with  only  O(logAT)  slowdown.  fThe  generalization  to  three  dimensions  is  straightforward.) 
These  networks  are  area-universal  precisely  because  they  display  the  kinds  of  locality  present 
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in  the  phone  system.  The  communications  hardware  is  arranged  in  a  hierarchy,  with  most  of  it 
devoted  to  making  local  connections. 

Distributed  random-access  machines 

Another  important  measure  of  locality  is  the  load  factor  of  a  set  of  packets.  Before  defining 
the  load  factor,  we  need  a  few  other  notions.  A  cut  5  of  a  network  is  a  subset  of  the  nodes 
of  the  network.  The  capacity  cap(S)  is  the  number  of  wires  connecting  processors  in  S  to  the 
rest  of  the  network,  3.  The  load  of  a  set  A/  of  packets  on  a  cut  5,  load(Af,  $),  is  the  number 
of  packets  in  M  that  must  cross  the  cut  S.  The  load  factor  of  M  on  5,  A(A/,$)  is  the  ratio 
of  the  load  to  the  capacity,  A(A/,S)  =  load(A/,S)/cap(S).  The  load  factor  of  A/  on  the  entire 
network  is  the  maximum  load  factor  over  all  cuts,  A(A/)  =  max$  ioad(A/,  S).  The  load  factor  is 
a  lower  bound  on  the  congestion  of  any  set  of  paths  for  the  packets,  and  thus  is  a  lower  bound 
on  the  time  to  deliver  the  packets. 

Chapter  2  introduces  a  model  called  the  Distributed  Random-Access  Machine  (DRAM) 
in  which  time  required  to  deliver  a  set  of  packets  is  equal  to  its  load  factor.  A  DRAM  is  an 
abstraction  of  a  parallel  computer  in  which  memory  accesses  arc  implemented  by  routing  packets 
through  a  communication  network.  The  model  was  originally  intended  to  be  an  abstraction  of  a 
class  of  area-universal  networks  called  fat-trees  [29, 56).  Fat-trees  are  well  modeled  by  DRAM’s 
becr.usc,  as  we  shall  sec  in  Chapter  1,  the  time  to  deliver  a  set  A/  of  packets  on  an  Af-node 
fat-tree  is  0(A(Af)  +  log  N)}  with  high  probability. 

Tiie  notion  of  load  factor  can  be  extended  to  measure  the  locality  of  a  data  structure 
embedded  in  a  parallel  computer.  A  natural  way  to  embed  a  data  structure  in  a  DRAM  is  to 
put  one  record  of  the  data  structure  into  each  processor.  The  record  can  contain  data,  including 
pointers  to  records  in  other  processors.  We  measure  the  locality  of  an  embedding  by  treating 
the  data  structure  as  a  set  of  pointers  and  generalizing  the  concept  of  load  factor  to  sets  of 
pointers.  The  load  of  a  set  P  of  pointers  across  a  cut  S,  denoted  load(P,.9),  is  the  number  of 
pointers  in  P  from  a  processor  in  S  to  a  processor  in  3,  or  vice  versa.  The  load  factor  of  P  on 
the  entire  DRAM  is  A (P)  =  maxs  load(P,  5)/cap(5).  The  load  factor  of  a  data  structure  is  the 
load  factor  of  the  set  of  its  pointers. 

A  conservative  algorithm  is  a  DRAM  algorithm  in  which  the  load  factor  of  the  set  of  mem- 
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ory  accesses  produced  at  each  step  doc*  not  exceed  the  load  factor  of  the  input  data  structure. 
A  conservative  algorithm  exploits  the  locality  in  the  input  data  structure  because  it  never  pro¬ 
duces  more  congestion  across  cuts  of  the  DRAM  than  is  implicit  in  the  input  data  structure. 
Consequently,  a  conservative  algorithm  is  guaranteed  not  to  produce  undue  congestion  in  any 
underlying  network.  With  the  help  of  a  lemma  for  “shortcutting"  pointers  in  a  data  struc¬ 
ture  without  increasing  Its  load,  wc  design  fast  conservative  algorithms  for  a  variety  of  graph 
problems.  Problems  such  as  computing  treewalk  numberings,  finding  the  separator  of  a  tree, 
and  evaluating  all  subexpressions  in  an  expression  tree  can  be  solved  in  O(logJV)  steps  for 
jY-nodc  trees  by  conservative  algorithms  for  an  exclusive-read  exclusive-write  DRAM.  More 
complex  problems  such  as  finding  a  minimum-cost  spanning  forest,  computing  biconnectcd 
components  and  constructing  an  Eulcrian  cycle  require  0(log2  A')  steps,  for  graphs  of  site  N. 
For  concurrent-read  concurrcnt-writc  D  RAM’s,  all  of  these  problems  can  be  solved  by  0(log  N) 
step  conservative  algorithms. 

t 

Emulations 

Of  particular  interest  is  the  special  case  where  the  embedded  data  structure  is  a  network.  An 
embedding  is  a  map  from  a  guest  network  to  a  host  network  that  takes  nodes  of  the  guest  to 
nodes  of  the  host,  and  edges  of  the  guest  to  paths  in  the  host.  Three  important  measures  of  an 
embedding  are  its  congestion,  dilation,  and  load.  The  congestion  and  dilation  of  the  paths  are 
analogous  to  the  congestion  and  dilation  defined  for  the  paths  taken  by  a  set  of  packets.  The 
load  of  an  embedding  is  the  maximum  number  of  guest  nodes  mapped  to  any  one  of  the  host 
nodes.  The  assignment  of  two  meanings  to  the  word  load  is  unfortunate,  but  well  established. 
In  this  thesis,  the  intended  meaning  should  always  be  dear  from  the  context.  Furthermore,  the 
load  of  a  set  Af  of  packets  on  a  cut  S  is  denoted  by  load(A/,S),  while  the  load  of  an  embedding 
is  denoted  by  /. 

A  guest  network  is  typically  embedded  in  a  host  network  so  that  the  host  can  emulate 
some  computation  to  be  performed  by  the  guest.  An  important  consequence  of  the  scheduling 
results  of  Chapter  1  is  that  if  a  guest  network  can  be  embedded  in  a  host  network  with  load  /, 
congestion  c,  and  dilation  d,  then  the  host  can  emulate  the  guest  with  slowdown  0(1  +  c  +  d). 
Most  of  the  efficient  emulation  schemes  that  we  know  of  arise  directly  from  an  embedding  of 
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a  guest  network  in  a  host  with  small  congestion,  dilation,  and  load.  As  we  shall  see,  however, 
a  good  embedding  of  the  guest  in  the  host  is  not  required  for  the  host  to  perform  an  efficient 
emulation  of  the  guest. 

Chapter  3  examines  the  problem  of  how  efficiently  a  host  network  can  emulate  a  guest 
network.  The  goal  is  to  emulate  Ta  steps  of  an  A'c-nodc  guest  network  on  an  N/t  node 
host  network.  We  call  an  emulation  \cork-prt9erving  if  the  time  required  by  the  host,  Tu  is 
0(TgNg/Nh)  because  then  both  the  guest  and  host  networks  perform  the  same  amount  of 
total  work  (processor-time  product),  6(7bMy)»  to  within  a  constant  factor.  A  work-preserving 
emulation  is  efficient  because  it  achieves  optimal  speedup  over  a  sequential  emulation  of  the 
guest.  We  say  that  an  emulation  is  real-time  if  T»  -  0(Tg),  because  then  the  host  emulates 
the  guest  with  constant  delay. 

Although  many  isolated  emulation  results  have  been  proved  for  specific  networks  in  the 
past,  and  measures  such  as  dilation  and  congestion  were  known  to  be  important,  the  field  has 
lacked  a  model  within  which  general  results  and  meaningful  lower  bounds  could  be  proved.  We 
attempt  to  provide  such  a  model,  along  with  techniques  for  proving  lower  bounds  based  on 
comparing  the  locality  the  networks.  As  a  general  rule,  networks  that  exhibit  locality  are  easier 
to  emulate  than  those  that  do  not. 

The  simplest  measure  of  the  locality  of  a  network  is  its  diameter.  Let  denote  the 

distance  between  a  pair  of  nodes  u  and  u,  i.e.,  the  length  of  the  shortest  path  between  u  and 
v.  The  diameter,  D ,  of  a  network  is  the  maximum  over  all  pairs  (u,v)  of  the  distance  between 
u  and  vfD~  max(U|U)  S(u,v).  In  general,  a  network  with  large  diameter  exhibits  more  locality 
than  a  network  with  small  diameter.  For  example,  a  linear  array  exhibits  more  locality  than  a 
shuffle-exchange  graph. 

The  expansion  rate  is  another  important  measure  of  the  locality  of  a  network.  Let  J9r(u) 
denote  the  ball  of  radius  r  around  a  node  u,  i.e.,  the  set  of  nodes  within  distance  r  of  u, 
Br(u)  =  {v|$(u,  v)  <  r}.  For  a  set  S  of  nodes,  the  neighborhood  of  5,  Ar(S),  is  the  set  of  nodes 
within  a  distance  of  1  of  some  node  in  5,  excluding  those  nodes  in  S,  N(S)  =  UugSJPi(“))  -  S. 
We  say  that  an  n-node  network  lias  expansion  rate  e  if  for  every  set  S  of  size  at  most  n/2,  the 
size  of  the  neighborhood  of  S  is  a  least  e|5j.  We  call  a  network  for  which  the  expansion  rate  e 
is  at  least  some  fixed  positive  constant  an  expander.  An  expander  exhibits  little  locality. 
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Some  of  the  more  interesting  and  diverse  results  in  Chapter  3  include  a  proof  that  a  linear 
wray  can  emulate  a  (much  larger)  butterfly  in  a  work-preserving  fashion,  but  that  a  butterfly 
cannot  emulate  an  expander  (of  any  sixe)  in  a  work-preserving  fashion;  a  proof  that  a  mesh  can 
be  emulated  in  real  time  in  a  work-preserving  fashion  on  a  butterfly,  even  though  any  0(1)- 
to-1  embedding  of  the  mesh  in  a  butterfly  ha*  dilation  fl(logtf );  and  a  proof  that  an  tf-nodc 
butterfly  can  emulate  an  JVlogtf-node  shuffle-exchange  graph  in  a  work-preserving  fashion, 

and  vice-versa. 

Mesh-based  algorithms 

Chapter  4  presents  an  algorithm  for  finding  a  minimum-cost  spanning  tree  of  an  N- node  graph 
on  an  N  x  N  mesh-connectcd  computer.  The  algorithm  has  the  same  0(N)  running  time  as 
the  previous  algorithms,  but  it  is  much  simpler.  In  VLSI  models,  the  mesh  is  the  ultimate  local 
network  because  each  processor  in  the  mesh  is  connected  to  a  small. number  of  neighbors  by 
minimum  length  wires. 


Chapter  1 

Packet  routing  algorithms 


1.1  Introduction 

Figure  1-1  illustrate*  the  standard  graph  model  for  packet  routing.  The  shaded  nodes  labeled 
1  through  6  represent  processors  or  twitches.  The  edges  between  the  nodes  represent  wires.  At 
the  end  of  each  edge  is  an  edge  queue  that  can  hold  a  small  number  of  packets  (in  this  example, 
two).  A  packet  is  depicted  by  a  square  box  containing  the  label  of  its  destination.  Before  the 
routing  begins,  packets  are  stored  at  their  origins  in  special  initial  queues.  For  example,  packets 
4  and  5  are  stored  in  the  initial  queue  at  node  1. 

The  goal  is  to  route  each  packet  from  its  origin  to  its  destination  via  a  series  of  synchronized 
time  steps.  At  each  step  at  most  one  packet  can  traverse  each  edge.  Furthermore,  a  packet  can 
traverse  an  edge  only  if  at  the  beginning  of  the  step  its  edge  queue  is  not  full.  Upon  traversing 
the  last  edge  on  its  path,  a  packet  is  removed  from  the  edge  queue  placed  in  a  special  final  queue 
at  its  destination.  For  simplicity,  the  final  queues  are  not  shown  in  Figure  1-1.  Independent 
of  the  routing  algorithm  used,  the  size  of  the  initial  and  final  queues  arc  determined  by  the 
particular  packet  routing  problem  to  solved.  Thus,  any  bound  on  the  maximum  queue  size 
required  by  a  routing  algorithm  refers  to  the  edge  queues  only. 

The  task  of  designing  an  efficient  packet  routing  algorithm  is  central  to  the  design  of  most 
large-scale  general-purpose  parallel  computers.  In  fact,  even  the  basic  unit  of  time  in  some 
parallel  machines  is  measured  in  terms  of  how  fast  the  packet  router  operates.  For  example, 

This  chapter  describes  joint  research  with  Tom  Leighton  and  Satish  Rao  [53], 
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CHAPTER  1.  PACKET  ROUTING  ALGORITHMS 


Figure  1-1:  A  graph  model  for  packet  routing. 

the  speed  of  an  algorithm  in  the  Connection  Machine  is  often  measured  in  terms  of  routing 
cycle*  (roughly  the  lime  to  route  a  random  permutation)  or  petit  cycle*  (the  time  to  perform 
an  atomic  step  of  the  routing  algorithm).  Similarly,  the  performance  of  machines  like  the  BBN 
Butterfly  is  substantially  influenced  by  the  speed  and  rate  of  successful  delivery  of  its  router. 

Packet  routing  also  provides  an  important  bridge  between  theoretical  computer  science 
and  applied  computer  science;  it  is  through  packet  routing  that  a  real  machine  such  as  the 
Connection  Machine  is  able  to  simulate  an  idealized  machine  such  as  the  CRCW  PRAM.  More 
generally,  getting  the  right  data  to  the  right  place  at  the  right  time  is  an  important,  interesting, 
and  challenging  problem.  Not  surprisingly,  it  has  also  been  the  subject  of  a  great  deal  of 
research. 

1.1.1  Past  work 

The  first  major  result  in  packet  routing  is  due  to  Benes  [10]  who  showed  that  the  inputs 
and  outputs  of  a  Benes  network  can  be  connected  in  any  permutation  by  a  set  of  disjoint 
paths.  Waksman  [98]  then  gave  a  simple  off-line  algorithm  for  finding  the  paths  in  linear  time. 
Given  the  paths,  it  is  straightforward  to  route  a  permutation  of  packets  from  the  inputs  to  the 
outputs  of  an  AT-node  Benes  network  in  0(log  N)  steps  using  queues  of  size  1.  Although  the 
inputs  comprise  only  0(N/  log  N)  nodes,  it  is  possible  to  route  any  permutation  of  N  packets 
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in  0(log  Af)  step*  by  pipelining  0(logiV)  such  permutation*.  Unfortunately,  no  efficient  on-line 
algorithm  for  finding  the  path*  is  known. 

Shortly  thereafter,  Batcher  [9]  devised  an  elegant  and  practical  on-line  algorithm  for  sorting 
N  packets  on  an  JV-node  shuffle-exchange  graph  in  log3  N  steps  using  queues  of  size  1.  The 
algorithm  can  be  used  to  route  any  pemutation  of  packets  by  sorting  based  on  destination  ad¬ 
dress.  The  result  extends  to  routing  many-one  problems  provided  that  (as  is  typically  assumed) 
combining  can  be  used  to  merge  packets  th.*s  have  a  common  destination. 

No  better  deterministic  algorithm  was  found  until  Ajtai,  Komlos,  and  Szemercdi  [2]  solved 
a  classic  open  problem  by  constructing  an  0(log  A?)-depth  sorting  network.  Leighton  [47]  then 
used  this  0(A'  log  Ar)-nodc  network  to  construct  a  degree  3  A'-node  network  capable  of  solving 
any  Af-packct  routing  problem  in  0(log  N)  steps  using  queues  of  size  1.  Although  this  result  is 
optimal  up  to  constant  factors,  the  constant  factors  are  quite  large  and  the  algorithm  is  of  no 
practical  use.  Hence,  the  effort  to  find  fast  deterministic  algorithms  has  continued.  Recently 
Upfat  discovered  an  0(Iog  A’)-step  algorithm  for  routing  on  an  expander-based  network,  called 
the  multibutterfly  (95).  Tt-  gorithm  solve*  the  routing  problem  directly  without  reducing  it 
to  sorting,  and  the  constant  factors  are  much  smaller  than  those  of  the  AKS-based  algorithms. 
In  (52),  we  show  that  the  multibutterfly  is  fault  tolerant  and  improve  the  constant  factors  in 
Upfal’*  algorithm. 

There  has  also  been  great  success  in  the  development  of  efficient  randomized  packet  routing 
algorithms.  The  study  of  randomized  algorithms  was  pioneered  by  Valiant  and  Hrebner  (97)  who 
showed  how  to  route  any  permutation  of  N  packets  in  O(logAf)  steps  on  an  Ar-node  hypercubc 
with  queues  of  size  O(logAf)  at  each  node.  Although  the  algorithm  is  not  always  guaranteed 
to  work,  it  is  guaranteed  to  work  with  probability  at  least  1  -  1/AT  for  any  permutation. 
This  result  was  improved  in  a  succession  of  fundamental  papers  by  Aleliunas  [3],  Upfal  (9-1), 
Pippengcr  (7(i),  and  Ranade  (81).  Aleliunas  and  Upfal  developed  the  notion  of  a  delay  path  and 
showed  how  to  route  on  the  shu file-exchange  and  butterfly  graphs  (respectively)  in  O(logA') 
steps  with  queues  of  size  0( log//).  Pippcnger  was  the  first  to  eliminate  the  need  for  large 
queues,  and  showed  how  to  route  on  a  variant  of  the  butterfly  in  0(logAr)  steps  with  queues 
of  size  0(1).  Ranade  showed  how  combining  could  be  used  to  extend  the  Pippenger  result  to 
include  inany-one  routing  problems,  and  tremendously  simplified  the  analysis  required  to  prove 
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such  a  result.  As  a  consequence  of  Ranadc's  work,  it  ha*  finally  become  possible  to  simulate  a 
step  of  an  A'-processor  CRCW  I*RAM  on  an  Af-node  butterfly  or  hypcrcubc  in  0(log  Ar)  steps 
using  constant-size  queues  on  each  edge. 

Concurrent  with  the  development  of  these  hypercube-rclated  packet  routing  algorithms  has 
been  the  development  of  algorithms  for  routing  in  meshes.  The  randomized  algorithm  of  Valiant 
and  Drcbner  can  be  used  to  route  any  permutation  of  N  packets  on  a  i/N  x  '/R  mesh  in 
Om  steps  using  queues  of  size  0(log  N).  Kunde  [43]  showed  how  to  route  deterministically 
in  (2  +  s)y/N  steps  using  queues  of  size  0(1  fc).  Also,  Krizanc,  Rajasckaran,  and  Tsantilis  [41] 
showed  how  to  randomly  route  any  permutation  in  2y/F  +  0(logjV)  steps  using  constant  size 
queues.  Most  recently,  Leighton,  Makedon,  and  Tollis  discovered  a  deterministic  algorithm  for 
routing  any  permutation  in  2%/iV  -  2  steps  using  constant-size  queues  [49],  thus  achieving  the 
optimal  time  bound  in  the  worst  case. 

1.1.2  Our  approach 

One  deficiency  with  the  state-of-the-art  in  packet  routing  is  that  aside  from  Valiant’s  paradigm 
of  “first  routing  to  a  random  destination,”  all  of  the  algorithms  and  their  analyses  are  very 
specifically  tied  to  the  network  on  which  the  routing  is  to  take  place,  as  well  as  to  the  requirement 
that  packets  arc  first  routed  to  destinations  that  are  (in  some  sense)  random.  For  example,  the 
butterfly  routing  algorithms  are  all  quite  different  than  the  mesh  algorithms  in  the  way  that 
queue  size  is  kept  constant.  Moreover,  the  butterfly  and  hypercube  algorithms  are  to  specific  to 
those  networks  that  no  0(log  AQ-step  constant-queue-size  algorithm  was  known  for  the  dowdy 
related  shuffle-exchange  graph.  The  lack  of  a  good  routing  algorithm  for  the  shuffle-exchange 
graph  is  one  of  the  reasons  that  the  butterfly  is  preferred  to  the  shuffle-exchange  graph  in 
practice. 

In  this  chapter,  we  take  a  significant  step  towards  the  development  of  a  universal  approach 
to  packet  routing.  Our  approach  to  the  problem  differs  from  previous  approaches  in  that  we 
separate  the  process  of  selecting  packet  paths  from  the  process  of  timing  packet  movements 
along  the  paths.  More  precisely,  given  any  underlying  network,  and  any  selection  of  paths  for 
the  packets,  we  study  the  problem  of  timing  the  movement  of  the  packets  so  as  to  minimize  the 
total  time  and  maximum  queue  size  needed  to  route  all  the  packets  to  their  correct  destinations. 
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Figure  1-2:  A  set  of  paths  for  the  packets.  Each  packet  follows  a  shortest  path  to  its  destination. 
The  dilation  is  d  =  3  and  the  congestion  ie  e  =  3. 

Of  course,  there  must  be  some  correlation  between  the  performance  of  the  algorithm  and 
the  selection  of  the  paths.  In  particular,  the  maximum  distance,  d,  traveled  by  any  packet 
is  always  a  lower  bound  on  the  time  required  to  route  all  packets.  We  call  this  distance  the 
dilation  of  the  paths.  Similarly,  the  largest  number  of  packets  that  must  traverse  a  single  edge 
during  the  entire  course  of  the  routing  is  a  lower  bound.  We  call  this  number  the  c ongcition, 
c,  of  the  paths. 

Viewed  in  terms  of  these  parameters,  then,  a  routing  problem  can  be  broken  into  two  stages. 
In  Stage  1,  we  select  paths  for  the  packets  so  as  to  minimize  c  and  d.  In  Stage  2,  we  schedule 
the  movement  of  the  packets  so  as  to  minimize  the  total  time  and  maximum  queue  size.  To 
illustrate  this  two  stage  approach,  let  us  return  to  the  routing  problem  of  Figure  1-1. 

Figure  1-2  shows  one  way  of  choosing  the  paths  for  the  packets.  Here,  each  packet  takes  a 
shortest  path  from  its  origin  to  its  destination.  For  example,  packet  1  follows  a  path  from  node 
3  to  2  to  4  to  1.  Since  no  packet  traverses  more  than  three  edges,  the  dilation  is  d  =  3.  Packets 
3, 4,  and  5  all  traverse  the  edge  from  1  to  2,  but  no  more  than  three  packets  traverse  any  other 
edge.  Thus,  the  congestion  is  c  =  3. 

A  schedule  for  the  packets  is  displayed  in  Figure  1-3.  A  schedule  simply  specifies  which 
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time  step 

1  2  3  4  5 


Figure  1-3:  A  schedule  for  the  packets.  An  X  in  row  p  and  column  t  indicates  that  at  time  t 
packet  p  mom.  A  blank  indicates  that  it  waits. 

packets  move  and  which  wait  at  each  time  step.  An  X  in  row  p  and  column  t  indicates  that 
at  time  l  packet  p  traverses  an  edge  and  enters  the  queue  at  the  end  of  that  edge.  A  blank 
indicates  that  at  time  l  packet  p  waits  in  a  queue.  For  example,  packet  3  mom  at  time  step  1, 
waits  at  steps  2  and  3,  and  then  mom  again  in  steps  4  and  5. 

The  step-by-step  progress  of  the  packets  as  they  follow  the  paths  from  Figure  1-2  according 
to  the  schedule  of  Figure  1-3  is  illustrated  in  Figure  1-4. 

Part  (a)  shows  the  packets  in  their  initial  queues  before  the  routing  begins.  In  the  first  step, 
packet  1  takes  the  edge  from  node  3  to  node  2,  3  takes  the  edge  from  5  to  1,  and  4  takes  the 
edge  from  1  to  2.  Packets  2  and  5  must  wait  because  the  first  edges  on  their  paths  are  taken 
by  packets  1  and  4,  respectively. 

The  positions  of  the  packets  at  the  end  of  time  step  1  are  shown  in  part  (b).  In  step  2, 
packets  1,  2,  and  5  move,  while  packets  3  and  4  wait.  Packet  2  reaches  its  destination,  is 
removed  from  the  queue  at  the  end  of  the  edge  from  3  to  2,  and  enters  the  final  queue  for  node 
2. 

At  the  end  of  step  2,  the  packets  are  positioned  as  shown  in  part  (c).  Note  that  packet  2, 
which  resides  in  the  final  queue  for  node  3,  is  not  pictured.  In  step  3,  packets  1  and  4  move, 
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Figure  1-4:  The  stcp-by-stcp  progress  of  the  packets.  The  positions  of  the  packets  at  the  ends 
of  steps  0  through  4  are  shown  in  parts  (a)  through  (e)  respectively. 
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but  packet  3  must  wail  became  the  queue  that  it  wishes  to  enter  is  full  at  the  beginning  of  the 
step. 

After  step  3,  only  packets  3  and  5  remain  en  route.  Both  packets  move  in  step  4,  and  reach 
their  destinations  in  step  5.  Their  positions  at  the  ends  of  steps  3  and  4  are  show  in  parts  (d) 
and  ve),  respectively. 

For  many  networks,  Stage  X  is  easy.  We  simply  use  Valiant’s  paradigm  of  first  routing  to  a 
random  destination,  and  then  routing  to  the  correct  destination.  It  is  easily  shown  for  meshes, 
butterflies,  shuffle-exchange  graphs,  etc.,  that  this  approach  yields  values  of  c  and  d  that  are 
within  a  small  constant  factor  of  the  diameter  of  the  network,  which  is  as  well  as  can  be  done. 
Moreover,  this  technique  also  usually  works  for  many.one  problems  provided  that  the  address 
space  is  randomly  hashed. 

Stage  2  has  traditionally  been  the  hard  part  of  routing.  Curiously,  however,  we  have  found 
that  by  ignoring  the  underlying  network  and  the  method  of  path  selection,  Stage  2  actually 
becomes  easier  to  solve!  Hence  we  will  be  able  to  obtain  results  for  routing  that  are  both  simpler 
and  far  more  general  than  existing  approaches.  Among  other  things,  we  will  be  able  to  route 
on  the  N- node  mesh  in  0(\/iV)  steps  using  constant  size  queues  with  the  same  algorithm  that 
uses  O(log  N)  steps  and  constant-size  queues  on  the  butterfly.  We  will  also  be  able  to  route  on 
the  shuffle-exchange  graph  in  O(logAf)  steps  with  constant-size  queues.  Also,  by  shewing  how 
to  route  efficiently  on  a  fat-tree,  we  provide  the  first  examples  of  volume  and  area-universal 
networks  that  require  only  O(log  N)  slowdown. 

1.1.3  Outline  of  the  results 

Our  most  difficult  result  is  a  proof  that  any  set  of  packets  whose  paths  have  congestion  c  and 
dilation  d  can  be  scheduled  so  as  to  complete  the  routing  in  0(c  +  d)  steps  using  constant-size 
queues.  This  result  is  optimal  up  to  constant  factors,  and  substantially  improves  the  naive 
bound  of  0(cd)  steps  and  0(c)  size  queues.  Unfortunately,  the  result  is  highly  nonconstructive, 
and  therefore  is  useful  only  if  substantial  amounts  of  off-line  computation  are  available  for  the 
routing.  On  the  other  hand,  the  result  is  robust  in  the  sense  that  it  provides  near-optimal 
schedule  of  packet  movements  for  any  set  of  paths  and  any  underlying  network.  Such  robustness 
is  particularly  useful  when  dealing  with  routing  problems  on  arbitrary  distributed  networks  as 
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In  (5-1).  The  proof  of  the  remit  is  contained  in  Section  1.2. 

We  do  not  know  whether  or  not  there  is  an  on-line  algorithm  that  can  route  any  set  of  paths 
in  0(c  +  d)  steps  with  constant-size  queues.  It  is  not  difficult  to  devise  a  randomized  on-line 
algorithm  to  schedule  any  set  of  N  paths  in  0(e+dlogjV)  steps  using  queues  of  size  O(logAf). 
In  special  cases,  however,  we  can  do  better.  For  example,  a  slight  variant  of  Ranadc’s  algorithm 
can  be  used  to  schedule  on-line  any  set  of  N  paths  on  a  bounded-degree  layered  network  in 
0(e  +  d  +  log  JV)  steps  using  constant-size  queues.  By  a  layered  network,  we  mean  a  network 
in  which  each  edge  connects  a  level  i  node  to  a  level » -f  1  node,  where  the  level  numbers  range 
from  0  to  d.  For  example,  the  butterfly  is  layered  this  fashion.  The  algorithm  is  randomized, 
but  requires  only  0(log3  N)  bits  of  randomness  to  succeed  with  high  probability.  The  proof  of 
this  result  is  included  in  Section  1.3.  Curiously,  the  proof  is  simpler  than  the  previous  proof  of 
the  same  result  applied  specifically  to  routing  random  paths  in  butterflies  [81].  (The  fact  that 
Ranado’s  algorithm  can  be  used  in  this  general  context  has  also  been  observed  by  Ranade  [82].) 

The  on-line  algorithm  for  layered  networks  can  immediately  be  applied  to  obtain  good 
routing  algorithms  for  meshes  and  butterflies.  With  some  ext»  effort,  it  can  be  applied  to 
obtain  the  first  algorithm  for  routing  kN  packets  on  an  iV-node  lb-dimensional  array  with 
maximum  side  length  A/  in  O(kAf)  steps,  constant-size  queues,  and  for  routing  Af-packets  on 
an  Af-nodc  shuffle-exchange  graph  in  O(log  N)  steps  using  constant-size  queues.  It  can  also  be 
applied  to  construct  a  class  of  networks  that  arc  area  universal  in  the  sense  that  the  network  in 
the  class  with  N  processors  has  area  0(N),  and  can,  with  high  probability,  simulate  in  0(Iog  N) 
steps  each  step  of  any  other  network  of  area  O(N).  An  analogous  result  is  shown  for  a  class 
of  volume-universal  networks.  The  routing  algorithm  is  used  as  a  subroutine  in  algorithms 
for  sorting  on  butterflies  and  multidimensional  arrays.  The  details  of  these  applications  arc 
included  in  Sections  1.4  through  1.9. 

This  thesis  leaves  open  the  question  of  whether  or  not  there  is  an  on-line  algorithm  that 
can  schedule  any  set  of  paths  in  0(c  +  d)  steps  using  constant-size  queues.  We  suspect  that 
finding  such  an  algorithm  (if  one  exists)  will  be  a  challenging  task.  Our  negative  suspicions 
are  derived  from  the  fact  that  we  can  construct  counterexamples  to  most  of  the  simplest  on¬ 
line  algorithms.  In  other  words,  for  several  natural  on-line  algorithms  (including  the  algorithm 
described  in  Section  1.3)  we  can  find  packet  paths  for  which  the  algorithm  will  construct  a 
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schedule  using  substantially  more  than  fl(c  +  d  +  log  N)  steps.  Several  of  the  counterexamples 
arc  included  in  Section  1.10. 


1.2  The  existence  of  asymptotically  optimal  schedules 

The  main  result  in  this  section  is  a  proof  that  for  any  set  of  packets  whose  paths  are  edge- 
simple1  and  have  congestion  c  and  dilation  d,  there  is  a  schedule  of  length  0(c  +  d)  in  which 
at  most  one  packet  traverses  each  edge  of  the  network  at  each  step,  and  at  most  0(1)  packets 
wait  in  each  queue  at  each  step.  Note  that  there  are  no  restrictions  on  the  size,  topology,  or 
degree  of  the  network  or  on  the  number  of  packets. 

Our,  strategy  for  constructing  an  efficient  schedule  is  to  make  a  succession  of  refinements  to 
the  “greedy”  schedule,  So,  in  which  each  packet  moves  at  every  step  until  it  reaches  its  final 
destination.  This  schedule  is  as  short  as  possible;  its  length  is  only  d.  Unfortunately,  as  many  as 
c  packets  may  use  an  edge  at  a  single  time  step  in  So,  whereas  in  the  final  schedule  at  most  one 
packet  is  allowed  to  use  an  edge  at  each  step.  Each  refinement  will  bring  us  closer  to  meeting 
this  requirement  by  bounding  the  congestion  within  smaller  and  smaller  frames  of  time. 

The  proof  uses  the  Lovasz  local  lemma  (89,  pp.  57-58)  at  each  refinement  step.  Given  a 
set  of  “bad”  events  in  a  probability  space,  the  lemma  provides  a  simple  inequality  which  when 
satisfied  guarantees  that  with  probability  greater  than  zero,  no  bad  event  occurs.  The  inequality 
relates  the  probability  that  each  bad  event  occurs  with  the  dependence  among  them.  A  set  of 
events  in  a  probability  space  has  dependence  at  most  b  if  every  event  is  mutually 

independent  of  some  set  of  m— b  other  bad  events.  The  lemma  is  nonconstructive;  for  a  discrete 
probability  space  it  proves  that  there  is  some  elementary  outcome  that  is  not  in  any  bad  event, 
but  does  not  specify  that  outcome. 

Lemma  1  (Lovasz)  Let  Ai,...,Am  be  a  set  of  “bad”  etienw  each  occurring  with  probability 
p  with  dependence  at  most  b.  If  4pb  <  1,  then  with  probability  grcaier  than  zero,  no  bad  event 
occurs.  □ 


’An  edge-timple  path  uses  no  edge  more  than  once. 
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1.2.1  A  preliminary  result 

Before  proving  the  main  remit  of  this  section,  we  show  that  there  is  a  schedule  of  length 
(c+  d) 2°0°«*(c+,0)  that  uses  queues  of  size  log(c+  d^Q**' This  preliminary  result  is 
substantially  simpler  to  prove  because  of  the  relaxed  bounds  on  the  schedule  length  and  queue 
size.  Nevertheless,  it  illustrates  the  basic  ideas  necessary  to  prove  the  main  result. 

Theorem  2  For  any  set  of  packets  whose  paths  are  edge-simple  and  have  congestion  c  and 
dilation  d,  ther~  s  a  schedule  in  which  at  most  one  packet  traverses  each  edge  at  each  step  with 
length  (c  +  <£)2° tW* («+^))  and  maximum  queue  size  log(e  +  . 


Proof:  For  simplicity,  we  shall  assume  without  loss  of  generality  that  c  =s  d,  so  that  the  bounds 
on  the  length  and  queue  size  are  d2°0°«*-)  and  (logd)2°(i®**^,  respectively. 

The  proof  has  the  following  outline.  The  first  step  is  to  assign  each  packet  a  delay  chosen 
randomly,  independently,  and  uniformly  from  the  range  [l,arf],  where  a  is  a  fixed  constant  th.U 
will  be  determined  later.  In  the  resulting  schedule,  S\,  a  packet  assigned  a  delay  of  x  waits 
in  its  initial  queue  for  x  steps,  then  moves  on  to  its  destination  without  waiting  again  until  it 
enters  is  final  queue.  The  length  of  Si  is  at  most  (1  +  oc)d.  Next  we  break  the  schedule  into 
(l  +  a)d/logd  sets  of  logd  consecutive  time  steps,  as  shown  in  Figure  1*5.  Each  of  these  sets  is 
called  a  logd- frame.  We  use  the  Lovasz  local  lemma  to  show  that  there  is  some  way  of  choosing 
the  initial  delays  so  that  in  each  of  these  logd- frames  at  mo6t  logd  packets  pass  through  any 
edge.  Finally,  we  view  each  log  d-frame  as  a  routing  problem  with  dilation  logd  and  congestion 
log  d,  and  solve  it  recursively. 

To  apply  the  Lovasz  Local  Lemma,  we  associate  a  bad  event  with  each  edge.  The  bad  event 
for  edge  e  is  that  more  than  logd  packets  use  c  in  any  logd-framc.  To  show  that  there  is  a 
way  of  choosing  the  delays  so  that  no  bad  event  occurs,  we  need  to  bound  the  dependence,  b, 
among  the  bad  events  and  the  probability,  p ,  of  each  individual  bad  event  occurring. 

The  dependence  calculation  is  straightforward.  Whether  or  not  a  bad  event  occurs  depends 
solely  on  the  delays  assigned  to  the  packets  that  pass  through  the  corresponding  edge.  Thus, 
two  bad  events  arc  independent  unless  some  packet  passes  through  both  of  the  corresponding 
edges.  Since  at  most  c  =  d  packets  pass  through  an  edge,  and  each  of  these  packets  passes 
through  at  most  d  other  edges,  the  dependence,  b,  of  the  bad  events  is  at  most  cd  =  d 2. 
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Figure  1*5:  Schedule  S\.  The  schc<lulc  is  derived  from  the  greedy  schedule,  So,  by  assigning 
each  packet  a  random  initial  delay  in  the  range  (l,od],  We  use  the  Lovaw  local  lemma  to  show 
that  within  each  logd-frame,  at  most  logd  packet*  pas*  through  each  edge. 

Computing  the  probability  of  each  bad  event  is  a  little  trickier.  Let  p  be  the  probability  of 
the  bad  event  corresponding  to  edge  c.  Then 

P~  logd  \logdj  \  ad  ) 

This  expression  is  derived  as  follows.  There  arc  (1  +  a)d/logd  different  log d- frames,  and  we 
bound  p  by  summing  over  all  frames  the  probability  that  more  than  logd  packets  pass  through  e 
in  the  frame.  The  number  of  packets  passing  through  e  in  the  frame  has  a  binomial  distribution. 
There  are  d  independent  Bernoulli  trials,  one  for  each  packet  that  uses  c.  Since  at  most  logd 
of  the  possible  ad  delays  will  actually  send  a  packet  through  e  in  the  frame,  each  trial  succeeds 
with  probability  log  d/ad.  (Here  we  use  the  assumption  that  the  paths  are  edge-simple.)  The 
probability  of  more  than  logd  successes  is  at  most  J  (^r) 

For  sufficiently  large  a,  the  product  4 pb  is  less  than  1,  and  thus,  by  the  Lovasz  Local 
Lemma,  there  is  some  assignment  of  delays  such  that  at  most  logd  packets  use  any  edge  in  any 
logd-frame. 

Each  logd-frame  can  be  viewed  as  a  separate  scheduling  problem  where  the  origin  of  a 
packet  is  its  location  at  the  beginning  of  the  frame,  and  its  destination  is  its  location  at  the 
end  of  the  frame.  If  at  most  logd  packets  use  each  edge  in  a  logd-frame,  then  the  congestion  of 
the  problem  is  logd.  The  dilation  is  also  logd  because  in  logd  time  steps  a  packet  can  move  a 
distance  of  at  most  logd.  In  order  to  schedule  each  frame  independently,  a  packet  that  arrives 
at  its  destination  before  the  last  step  in  the  rescheduled  frame  is  forced  to  wait  there  until  the 
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next  frame  begins. 

All  that  remains  is  to  bound  the  length  of  the  schedule  and  the  size  of  the  queues.  The 
recursion  proceeds  to  a  depth  of  0(log*  d)  at  which  point  the  frames  have  size  0(1),  and  at 
most  0(1)  packets  use  each  edge  in  each  frame.  The  resulting  schedule  can  be  converted  to  one 
in  which  at  most  one  packet  uses  each  edge  in  each  time  step  by  slowing  it  down  by  a  constant 
factor.  The  length  of  the  final  schedule  is  The  bound  on  the  queue  size  follows  from 

the  observation  that  no  packet  waits  at  any  one  spot  (other  than  its  origin  or  destination)  for 
more  than  (logd)2°^,<,<*,,l  consecutive  time  steps,  and  in  the  final  schedule  at  most  one  packet 
traverses  each  edge  at  each  time  step.  □ 

1.2.2  The  main  result 

Proving  that  there  is  a  schedule  of  length  0(e  +  d)  using  constant-size  queues  is  more  difficult. 
Removing  the  factor  in  the  length  of  the  schedule  seems  to  require  delving  into 

second  order  terms  in  the  probability  calculations,  and  reducing  the  queue  size  to  0(1)  mandates 
greater  care  in  spreading  delays  out  over  the  schedule. 

Before  proceeding,  we  need  to  introduce  some  notation.  The  frame  congestion,  C,  in  a 
T-framc  is  the  largest  number  of  packets  that  traverse  any  edge  in  the  frame.  The  relative 
congestion,  R,  in  a  T-framc  is  the  ratio  C/T  of  the  congestion  in  the  frame  to  the  size  of  the 
frame. 

Theorem  3  For  any  set  of  packets  whose  paths  are  edge-simple  and  have  congestion  c  and 
dilation  d,  there  is  a  schedule  in  which  at  most  one  packet  traverses  each  edge  of  the  network 
at  each  step  with  length  0(c  +  d)  and  maximum  queue  size  0(1). 

Proof:  To  make  the  proof  more  modular,  bounds  on  frame  size  and  relative  congestion  after 
each  step  in  the  construction  are  stated  as  lemmas.  These  lemmas  and  their  proofs  arc  included 
within  the  proof  of  the  theorem.  We  assume  without  loss  of  generality  that  c  =  d,  so  that  the 
bound  on  the  length  of  the  schedule  is  0(d). 

As  before,  the  strategy  is  to  make  a  succession  of  refinements  to  the  greedy  schedule,  So-  The 
first  refinement  is  special.  It  transforms  So  into  a  schedule  S\  in  which  the  relative  congestion 
in  each  logd-frame  is  at  most  0(1).  Thereafter,  each  refinement  transforms  a  schedule  Si  with 


30 


CHAPTER  1.  PACKET  ROUTING  ALGORITHMS 


;(f+D 


Figure  1-6:  A  refinement  step.  Each  refinement  transforms  a  schedule  Si  into  a  slightly  longer 
schedule  5,+j.  The  frame  size  is  greatly  reduced  in  Si± i,  yet  the  relative  congestion  within  a 
frame  remains  about  the  same,  i.e.,  <  /W  and  nl+,l  «  r('). 

relative  congestion  at  most  in  any  frame  of  sixe  /W  or  greater  into  a  schedule  with 
relative  congestion  at  most  r(,+1l  in  any  frame  of  size  /('+l)  or  greater,  where  «  rW  and 
/(•+>)  /( Oj  as  shown  in  Figure  1-6.  As  well  shall  see,  after  j  refinements,  where  j  —  0(log*  d), 
we  obtain  a  schedule  Sj  with  relative  congestion  0(1 )  in  every  frame  of  size  Jbo  or  greater,  where 
ko  is  some  constant.  From  Sj  it  is  straightforward  to  construct  a  schedule  of  length  0(e  +  d) 
in  which  at  most  one  packet  traverses  each  edge  of  the  network  at  each  step,  and  at  most  0(1) 
packets  wait  in  each  queue  at  each  step. 

At  the  start,  the  relative  congestion  in  a  d- frame  of  So  is  at  most  1.  We  begin  by  assigning 
each  packet  a  random  delay  chosen  uniformly  from  1  to  d  at  the  beginning  of  the  greedy  schedule 
So .  Using  the  Lovasz  local  lemma,  it  is  possible  to  show  that  there  is  some  way  of  choosing  the 
delays  so  that  in  the  resulting  schedule  Si,  the  relative  congestion  is  at  most  r-U)  =  0(1)  in  any 
frame  of  size  1 W  =  logd  or  greater. 

Next,  we  repeatedly  refine  the  schedule  to  reduce  the  frame  size.  As  we  shall  see,  the  relative 
congestion  r(,+1)  and  frame  size  for  schedule  S;+\  are  given  by  the  recurrences 

0(1)  i  =  1 

r«(l  +  0(l)/^log/W)  *>1 


r(i+i)  = 
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and 


/(•'+«) 


log  d  i  =  1 
log4/^  i  >  1 


which  have  solutions  /W  =  0(1)  and  rW  =  0(1)  for  some  j,  where  j  =  0(log*  d). 

We  have  not  explicitly  defined  the  value*  of  rW  and  /W  for  which  the  recursion  terminates. 
However,  in  several  places  in  the  proof  that  follows  we  implicitly  use  the  fact  that  /W  is 
sufficiently  large  or  rW  is  sufficiently  small  that  some  inequality  holds.  The  recursion  terminates 
when  the  first  of  these  inequalities  fail*  to  hold.  When  this  happens,  one  of  rW  or  /W  is  0(1), 
which  implies  that  the  other  is  also. 

An  important  invariant  that  we  main  maintain  throughout  the  construction  is  that  in  sched¬ 
ule  5,.fi  every  packet  waits  at  most  once  every  /M  steps.  As  a  consequence,  a  packet  waits  at 
most  once  every  0(1)  steps  in  Sj ,  which  implies  both  that  the  queues  in  Sj  cannot  grow  larger 
than  0(1)  and  that  the  total  length  of  Sj  is  0(d).  Schedule  Sj  almost  satisfies  the  requirement 
that  at  most  one  packet  traverse  each  edge  in  each  step.  By  simulating  each  step  of  Sj  in  0(1) 
steps  we  can  meet  this  requirement  with  only  a  factor  of  2  increase  in  the  queue  size  and  a 
factor  of  0(1)  increase  in  the  running  time. 

The  rest  of  the  proof  describes  a  refinement  step  in  detail.  For  case  of  notation,  we  use  I 
and  r  in  place  of  /W  and  rM. 

The  first  step  in  the  ith  refinement  is  to  break  schedule  Si  into  block*  of  2/3  +  27*  -  / 
consecutive  time  steps.  Each  block  is  rescheduled  independently. 

For  each  block,  each  packet  is  assigned  a  random  delay  chosen  independently  and  uniformly 
from  1  to  /.  A  packet  assigned  a  delay  of  *  must  wait  for  x  steps  at  the  beginning  of  the  block. 
In  order  maintain  the  invariant  that  in  schedule  Si+\  every  packet  waits  at  most  once  every 
jM  steps,  the  packet  is  not  delayed  for  x  consecutive  steps  at  the  beginning  of  the  block,  but 
instead  a  delay  is  inserted  every  T  steps  in  the  first  xl  steps  of  the  block.2  A  packet  that  is 
delayed  x  steps  reaches  its  destination  at  the  end  of  the  block  by  step  2/3  +  2/2  -  /  +  x.  Since 
some  packet  may  have  delay  x  =  /,  the  rescheduled  block  must  have  length  2/3  +  2 12. 


aBefore  the  delays  for  schedule  S,+j  have  been  inserted,  a  packet  is  delayed  at  most  once  in  each  block  of  S,. 
Prior  to  inserting  each  new  delay  into  a  block,  we  check  if  it  is  within  Jt‘>  steps  of  the  single  old  delay.  If  the 
new  delay  would  be  too  close  to  the  old  delay,  then  it  is  simply  not  inserted.  The  loss  of  a  single  delay  in  a  block 
has  a  negligible  effect  on  the  probability  calculations  in  the  lemmas  that  follow. 
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Figure  1-7:  Hounds  on  frame  size  and  relative  congestion  after  inserting  delays  into  S;.  Here 
A  =  log5  /  »nd  n  =  r(l  +  0(l)/v/Tog7). 

In  order  to  Independently  reschedule  the  next  block,  the  packets  must  reside  in  exactly  the 
same  queues  at  the  end  of  the  rescheduled  block  that  they  did  at  the  end  of  the  block  of  S{. 
Since  some  packets  arrive  early,  they  must  be  slowed  down.  Thus,  if  a  packet  is  assigned  delay 
x,  then  a  delay  is  inserted  every  /  steps  in  the  last  /(/-  x)  steps  of  the  block.  Note  that  at  the 
beginning  of  the  first  block  and  end  of  the  last  block,  it  is  not  necessary  to  separate  the  delays 
by  I  steps,  because  the  packets  reside  in  their  initial  and  final  queues,  respectively. 

Lemmas  l  through  G  bound  the  frame  size  and  relative  congestion  in  various  parts  of  the 
block  after  the  delays  are  inserted  into  5;.  The  bounds  are  shown  in  Figure  1-7.  Inserting 
delays  may  increase  the  relative  congestion  in  the  I7  steps  at  the  beginning  and  end  of  each 
block.  Lemma  4  shows  that  by  increasing  the  frame  size  from  I  to  I7  we  can  bound  the  relative 
congestion  in  these  regions  by  r(l  +  1//).  Lemma  6  shows  that  between  the  first  and  last 
/ 7  steps  we  can  decrease  the  frame  size  from  /  to  log*/,  while  only  increasing  the  relative 
congestion  in  each  frame  from  r  to  r(l  +  0(l)/v/Iog7).  The  proof  of  Lemma  6  uses  Lemma  5 
to  bound  the  relative  congestion  over  a  wide  range  of  frame  sizes. 

Lemma  4  For  any  choice  of  delays,  the  relative  congestion  in  any  frame  of  size  I7  or  greater 
after  the  delays  are  inserted  is  at  most  r(l  +  1  //). 

Proof:  After  the  delays  are  inserted,  a  packet  can  use  an  edge  in  a  T-frame  if  it  used  the  edge 
in  the  frame  or  in  any  of  the  I  steps  before  the  frame  in  S;.  Thus,  at  most  r(T  + 1)  packets 
can  use  an  edge  in  the  T- frame.  For  T>P,  the  relative  congestion  is  at  most  r(l  + 1//).  □ 


Lemma  5  In  any  schedule ,  if  the  relative  congestion  in  every  frame  of  size  T  to  2T  —  1  is  at 
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mott  R  then  the.  relative  congestion  in  any  frame  of  size  T  or  greater  is  at  most  It. 

Proof:  Consider  a  frame  of  she  T‘,  where  T  >  2 T  -  1.  The  first  ({T'/^l  -  1)T  steps  of  the 
frame  can  be  broken  into  T»framcs,  each  with  relative  congestion  R.  The  remainder  of  the 
T'-frame  consists  of  a  single  frame  of  tixe  between  T  and  2T  -  1  steps  in  which  the  relative 
congestion  is  also  at  most  R.  □ 

Lemma  0  There  is  some  ur ay  of  choosing  the  packet  delags  so  that  in  between  the  first  and  last 
1 7  steps  of  a  block,  the  relative  congestion  in  any  frame  of  size  I\  ~  log2 1  or  greater  is  at  most 
n  =  r(l  +  £j),  where  £j  =  0(l)/v1og/. 

Proof:  With  each  edge  we  associate  a  bad  event.  For  edge  c,  a  bad  event  occurs  when  more 
than  r\T  packets  use  e  in  any  7'-frame  for  T  in  the  range  I\  to  2 T\  —  1.  To  show  that  no  bad 
event  occurs,  we  need  to  bound  both  the  dependence  of  the  bad  events  and  the  probability  that 
an  individual  bad  event  occurs. 

We  first  bound  the  dependence.  At  most  r(2/3 +2/2  ~  /)  packets  use  an  edge  in  the  block3. 
Each  of  these  packets  travels  through  at  most2/3+2/3-/  other  edges  in  the  block.  As  we  shall 
see  later,  it  will  always  be  true  that  r  =  rW  =  0(1).  Thus  a  bad  event  depends  on  b  =  0(1 *) 
other  bad  events. 

Now  let  us  compute  an  upper  bound  on  the  probability,  pi,  that  more  than  r)/j  packets 
use  an  edge  in  a  particular  /j-frame.  Since  a  packet  may  be  delayed  up  to  I  steps  before  the 
frame,  any  packet  that  uses  e  in  the  frame  or  in  any  of  the  I  steps  before  the  frame  in  Si  may 
use  e  after  the  delays  are  inserted  into  S;.  Thus,  there  are  at  most  r(/  +  /i)  packets  that  can 
use  e  in  the  frame.  For  cadi  of  these  the  probability  that  the  packet  uscj  e  in  the  frame  after 
being  delayed  is  at  most  (Iifl).  If  we  assume  that  no  packet  uses  an  edge  more  than  once,  then 
these  probabilities  are  independent.  Thus,  the  probability  pi  that  more  than  n  Jj  packets  use 
the  frame  is  at  most 

Pi  <  'if  (r(J'+/,))(/,//)‘(1  - 

throughout  the  following  lemmas  we  make  references  to  quantities  such  as  '•/  packets  or  log4 1  time  steps, 
when  in  fact  r/  and  log4  /  may  not  be  integral.  Rounding  these  quantities  to  integer  values  when  necessary  does 
not  affect  the  correctness  of  the  proof.  For  ease  of  exposition,  we  shall  henceforth  cease  to  consider  the  issue. 
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Let  ri  =  r(l  +  £j).  We  bound  the  series  u  follow*.  There  arc  at  most  r(/  +  A)  term*,  and  the 
largest  of  the*e  occurs  for  k  =  rj7i.  Applying  the  inequalities  (1+*)  <  ln(l+z)  >  x-z3/2 
for  0  <  *  <  1,  and  (J)  <  (ac./bf  for  0  <  b  <  a  to  this  term,  we  hare 

For  A  =  log3  7  and  £j  =  &i/>/Iog/,  we  can  ensure  that  p  <  l//*1,  for  any  constant  fcj  >  0  by 
making  constant  ky  large  enough. 

Next  we  need  to  bound  the  probability  pj  that  more  than  r,  A  packet*  use  e  in  any  A-frame 
of  the  block.  There  are  at  most  0(73)  A-frames.  Thus  pa  <  0(/3)pi-  By  making  the  constant 
ki  large  enough,  we  can  ensure  that  p*  <  1  //**,  for  any  constant  k$  >  0. 

The  calculations  for  frames  of  size  A  +  1  through  2A  “  4  we  similar.  There  are  at  most 
0(/3)  frames  of  any  one  size,  and  2A  frame  sizes  between  A  and  2A  -  1*  By  adjusting  the 
constants  as  before,  we  can  guarantee  that  the  probability  p  that  more  than  ryT  packets  use  e 
in  any  T- frame  for  T  between  A  and  2 A  -  1  is  at  most  1  //*»  for  any  constant  kA  >  0. 

Finally,  since  a  bad  event  depends  on  only  b  =s  0(1*)  other  bad  events,  we  can  make4p6  <  1 
by  making  *4  large  enough.  By  the  Lovasz  local  lemma,  there  is  some  way  of  choosing  the  packet 
delays  so  that  no  bad  event  occurs.  ^ 

Although  the  frame  size  in  the  center  of  each  block  has  decreased,  it  has  increased  from 
T  to  JT3  in  the  first  and  last  f2  steps  of  the  block.  Before  decreasing  the  frame  size  in  these 
regions,  we  move  the  block  boundaries  to  the  centers  of  the  blocks,  as  shown  in  Figure  1*8.  Now 
each  block  of  size  273  +  273  has  a  “fuzzy”  region  of  size  2 17  in  its  center  in  which  the  relative 
congestion  in  any  frame  of  size  73  or  greater  is  r(l  +  1//).  In  the  /3  steps  before  and  after 
the  fuzzy  region,  the  relative  congestion  in  any  frame  of  size  A  or  greater  is  rj.  To  reduce  the. 
frame  size  in  the  fuzzy  region,  we  assign  a  random  delay  from  1  to  T3  to  each  packet.  A  packet 
with  delay  z  waits  once  every  73/z  steps  in  the  73  steps  before  the  fuzzy  region  and  once  every 
73/(73  -z)  steps  in  the  73  steps  after  the  region.  The  rescheduled  block  now  has  size  273  +  373. 

We  now  show  that  there  is  some  way  of  inserting  delays  into  the  schedule  before  the  fuzzy 
region  that  both  reduces  the  frame  size  in  the  fuzzy  region,  and  does  not  increase  either  the 
frame  size  or  the  relative  congestion  before  the  fuzzy  region  by  much.  A  similar  analysis  holds 
after  the  fuzzy  region. 
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Figure  1-8:  A  block  after  recenterlng.  The  “fussy  region”  in  the  center  of  the  block  is  shawled. 

Lemma  7  There  it  tome  troy  of  c hooting  the  packet  delayt  to  that  between  ttept  I  log3  /  and 
tlept  I3,  the  relative  congettion  in  any  frame  of  ti:e  I\  or  greater  it  at  mott  rj  =  r(l  +  Cj), 
where  c a  =  0( l)/\/Iog7,  and  to  that  in  the  fuxy  region  the  relative  congettion  in  any  frame  of 
ti:e  Ii  or  greater  it  at  mott  r3  =  r(l  +  r3),  where  c3  =  0(l)/\/log/. 

Proof:  Since  no  delays  are  inserted  into  the  fussy  region,  the  proof  that  the  frame  sise  has 
been  reduced  in  the  fussy  region  is  analogous  to  the  proof  of  the  previous  lemma. 

Before  the  fussy  region,  the  situation  is  more  complex.  By  the  Jfcth  step,  0  <  k  <  /3,  a 
packet  with  delay  *  has  waited  xk/I3  times.  Thus,  the  delay  of  a  packet  at  the  jfcth  step  varies 
essentially  uniformly  from  0  to  u  =  kfl.  For  u  >  log3/,  or  equivalently,  k  >  /log3/,  we  can 
show  that  the  relative  congestion  in  any  frame  of  sise  /j  or  greater  has  not  increased  much. 

The  proof  uses  the  Lovasz  local  lemma  as  before.  The  calculation  for  the  dependence  is 
unchanged.  The  probability  pj  that  more  than  1*2/1  packets  use  an  edge  e  in  a  particular 
/i-frame  is  given  by 

pi<  'e  ]  (r,(V  u))(/i/«o*(i ~ /1/u)r'</‘+-)-J. 

tnr3h  \  9  / 

Using  the  same  inequalities  as  before,  we  have 
P2  <  0(rj(li  + 

For  /1  =  log2/,  u  >  log3/,  it  suffices  that  £2  =  0(l)/\AogI.  □ 

For  steps  0  to  /log3/,  we  use  the  following  lemma  to  bound  the  frame  size  and  relative 
congestion. 
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Figure  1-9:  Final  bounds  on  frame  size  and  relative  congestion.  To  reduce  the  frame  size  in  the 
fuzzy  regions,  delays  arc  inserted  only  outside  the  shaded  region.  Here  I\  =  log1  /,  /j  =  log4  /, 
ra  =  r(l+0(l)M5g7),ra  =  r(l+0(l)/\4dg7),  and  r4  s  rj(i+l/log/)  =  r(l+0(l)/-/log/). 

Lemma  8  The  relative  congestion  in  any  frame  of  ji:elj  or  greater  between  steps  0  and  /log3  / 
it  at  most  r4l  where  /j  s  log4  f  and  r4  =  rj(l  •{*  1/  log/). 

Proof:  The  proof  is  similar  to  that  of  Lemma  4.  '  □ 

We  have  now  completed  our  transformation  of  schedule  5,  into  schedule  S,+j.  Let  us  review 
the  relative  congestion  and  frame  sizes  in  the  different  parts  of  a  block.  Between  steps  0  and 
/log3/,  the  relative  congestion  in  any  frame  of  size  /j  or  greater  is  at  most  r4.  Between  this 
region  and  the  fuzzy  region,  the  relative  congestion  in  any  frame  of  size  /»  or  greater  is  at  most 
rj.  In  the  fuzzy  region,  the  relative  congestion  in  any  frame  of  size  I\  or  greater  is  at  most  r3. 
After  the  fuzzy  region,  the  relative  congestion  in  any  frame  of  size  /j  or  greater  is  again  ra, 
until  step  2/3+3/2  —  /  log3  /,  where  the  relative  congestion  in  any  frame  of  size  h  or  greater  is 
r4.  These  bounds  are  shown  in  Figure  1-9.  For  the  entire  block  it  Is  safe  to  say  that  the  relative 
congestion  in  any  frame  of  size  =  log4 /  or  greater  is  at  most  r(,+1)  =  r(l  +  0(l)/v/log7). 
□ 


1.3  On-line  algorithms 

1.3.1  An  0(c  +  d  log  N)  on-line  algorithm 

By  applying  the  type  of  probabilistic  analysis  used  in  Section  1  ',  it  is  fairly  straightforward 
to  schedule  any  set  of  N  packets  in  0(c  +  dlogAr)  steps  with  queues  of  size  OQo%N).  We 
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limply  delay  the  start  of  each  packet  by  a  random  amount  that  is  chosen  uniformly  from 
ll»Egjd«  an<l  then  route  all  the  packets  forward  in  a  synchronised  fashion.  More  precisely,  we 
introduce  the  initial  delays  and  then  consider  the  unconstrained  schedule  without  regard  for 
the  rule  that  at  most  one  packet  traverse  any  edge  in  a  single  step.  With  high  probability,  no 
more  than  O(logA')  packets  will  want  to  traverse  any  edge  at  any  step  of  the  unconstrained 
schedule.  Hence  we  can  simulate  each  step  of  the  unconstrained  schedule  with  O(logAf)  steps 
of  a  legitimate  schedule.  The  final  schedule  consumes  0((d  +  E*y)log^)  =  0(c  +  dlogtf) 
steps  to  complete  the  routing  and  uses  queues  of  size  0(log  A* ). 

1.3.2  An  0(c  +  d  +  log  N )  on-line  algorithm  for  layered  networks 

In  this  section  we  show  how  to  route  AT  packets  whose  paths  have  congestion  c  on  a  bounded- 
degree  layered  network  with  levels  0  through  d  in  0(c  -f  d  • f  log  A')  steps  with  high  probability 
using  constant-size  queues.  A  packet  can  originate  at  any  node  in  the  network,  but  its  desti¬ 
nation  must  be  on  a  level  with  a  larger  number.  No  bound  is  placed  on  the  size  of  the  initial 
and  final  queues.  The  edge  queues,  however,  can  each  hold  at  most  q  packets.  The  value  of  q 
can  be  any  constant  integer  (including  1),  and  will  affect  the  overall  routing  time  by  a  constant 
factor.  Each  node  has  has  in-degrcc  and  out-degree  at  most  A,  where  A  is  a  fixed  constant. 

The  scheduling  algorithm  is  identical  to  Ranade’s  algorithm  except  that  instead  of  ordering 
the  packets  based  on  destination  address,  we  order  them  according  to  random  ranks.  In  par¬ 
ticular,  each  packet  is  assigned  a  random  rank  chosen  randomly,  independently,  and  uniformly 
from  the  range  [l,w],  where  in  will  be  specified  later.  A  packet  is  routed  through  a  node  only 
after  all  the  other  packets  with  lower  ranks  that  mud  pass  through  the  node  have  done  so.  Tics 
in  rank  are  broken  according  to  destination  address. 

The  routing  protocol  guarantees  that  the  packets  in  each  queue  are  arranged  from  head 
to  tail  in  order  of  increasing  rank.  Before  routing  begins,  the  packets  in  each  initial  queue 
are  sorted  according  to  rank.  At  the  tail  of  each  initial  queue  there  is  a  special  end-of-stream 
(EOS)  packet  with  the  largest  possible  rank.  All  queues  operate  in  a  first-in  first-out  (FIFO) 
manner.  At  each  step,  a  node  examines  the  heads  of  its  initial  and  input  edge  queues.  If  any  of 
these  queues  are  empty,  then  the  node  does  nothing.  Otherwise,  it  selects  the  packet  with  the 
smallest  rank  as  a  candidate  to  be  transmitted.  The  candidate  is  sent  forward  only  if  the  edge 
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queue  that  it  must  enter  contain*  fewer  than  q  packets  at  the  beginning  of  the  step.  Thus,  an 
edge  queue  is  guaranteed  never  to  hold  more  than  q  packets. 

To  prevent  queues  from  becoming  empty,  whenever  a  node  transmits  a  packet  along  one 
output  edge,  it  sends  a  ghost  packet  with  the  same  rank  along  all  of  its  other  output  edges. 
The  rank  of  the  ghost  packet  provides  the  node  on  the  next  level  with  a  lower  bound  on  the 
ranks  of  the  packets  that  it  wiU  receive  in  the  future.  Ghost  packets  allow  a  node  to  transmit 
a  packet  without  having  to  wait  for  actual  packets  (if  any)  of  higher  rank  to  arrive  on  all  of  its 
input  edges.  Thus,  a  node  starts  transmitting  packets  as  soon  as  it  has  received  some  kind  of 
packet  on  each  of  its  input  edges,  and  at  each  step  thereafter,  it  transmits  a  packet  on  all  of  its 
output  edges  until  it  sends  an  EOS  packet.  For  simplicity  we  will  assume  that  the  queue  sue  is 
at  least  two,  so  that  once  a  queue  contains  a  packet,  it  does  not  become  empty  until  the  node 
transmits  an  EOS  packet.  With  minor  modifications,  the  analysis  can  be  made  to  work  with 
queues  of  size  one. 

A  ghost  never  remains  at  a  node  for  more  than  one  step  and  never  resides  in  a  queue  except 
at  the  head.  At  the  end  of  each  step,  a  node  first  destroys  any  ghosts  that  were  present  in  its 
edge  queues  at  the  beginning  of  the  step,  then  destroys  any  ghosts  not  at  the  head  of  a  queue. 

To  prove  that  the  algorithm  completes  the  routing  in  0(c  +  d  +  log  tf)  steps,  we  use  the 
same  delay  path  argument  as  Ranade  (81)  (which,  in  turn  is  quite  similar  to  the  ones  used  by 
Aloliunas  [3]  and  Upfal  [94)),  but  we  amplify  the  counting  part  of  the  analysis.  The  simplified 
counting  has  the  additional  nice  feature  that  it  allows  the  edge  queue  size  to  be  as  small  as  one, 
which  was  not  possible  with  Ranade’s  original  analysis. 

A  delay  sequence  has  four  components.  The  first  is  a  path  of  length  /  that  begins  on  level  i 
at  the  destination  of  some  packet.  The  path  may  traverse  edges  in  either  the  forward  direction 
(i.c.,  from  a  level  i  to  a  level  i  + 1)  or  in  the  backward  direction.  If  /  is  the  number  of  forward 
edges  traversed  on  the  path,  then  l<d+2f.  The  second  component  is  a  sequence  . . .  ,sw  of 
not  necessarily  distinct  nodes  on  the  path.  The  third  component  is  a  sequence  pi, ...  ,pw  of  w 
distinct  packets  such  that  the  path  for  packet  p;  passes  through  node  s;.  The  final  component 
is  a  sequence  rj,.. .  ,rw  of  ranks  such  that  r;  <  r;+ 

Each  delay  sequence  corresponds  to  a  bad  event  in  a  »ility  space.  The  only  use  of 
randomness  in  the  algorithm  is  in  the  choice  cf  ranks  for  nets.  Thus,  the  probability 
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space  consists  of  equally  likely  elementary  outcomes,  one  for  each  possible  setting  of  the 
ranks.  A  delay  sequence  corresponds  to  the  event  that  the  rank  chosen  for  packet  p;  is  r,*,  for 
l  <  i  <  w.  Each  bad  event  consists  of  uj^~w  elementary  outcomes  and  occurs  with  probability 
l/ww. 

The  following  lemma  is  the  crux  of  Ranadc’s  argument. 

Lemma  0  (Ranade)  For  any  w,  if  some  packet  is  not  delivered  by  step  d+w  then  a  bad  event 
corresponding  to  a  delay  sequence  with  qf  <  w  occurs. 

Corollary  10  If  no  bad  event  occurs,  then  all  of  the  packets  are  delivered  within  d  +  w  steps. 

The  theorem  below  presents  our  simplified  counting  argument. 

Theorem  11  for  any  ktl  there  is  a  Jtj  such  that  the  probability  that  any  packet  is  not  delivered 
by  step  d+w,  where  w  =  kj(d  +  c  +  log  N),  is  at  most  l/Nkl . 

Proof:  To  bound  the  probability  that  some  packet  is  delayed  to  steps,  we  need  only  bound  the 
probability  that  some  bad  event  occurs.  This  probability  is  at  most 

io» 

The  numerator  is  an  upper  bound  on  the  number  of  different  delay  sequences,  each  correspond¬ 
ing  to  a  bad  event.  There  arc  at  most  N  places  that  the  path  can  start,  at  most  (2A)f  ways 
that  it  can  continue,  at  most  ways  of  selecting  ihe  nodes  Sj,.. .  ,sw  on  the  path,  at  most 
(Ac)"  ways  to  pick  the  packets  pi,...,pw  that  pass  through  s\,...,sw,  and  at  most  (2J)  ways 
to  choose  the  ranks  ri,...,rw.  Since  the  ranks  are  chosen  from  [l,u»],  the  probability  that 
a  bad  event  occurs  is  l/ww.  Using  the  inequality  l  <  d  +  2f  <  d  +  2 w/q,  we  see  that  for 
w  —  Cl(d  +  c  +  log  A'),  this  probability  can  be  made  arbitrarily  small,  even  if  q  =  2.  □ 

For  simplicity,  we  have  heretofore  ignored  the  possibility  of  combining  multiple  packets 
with  the  same  destination.  In  many  routing  applications,  there  is  a  simple  rule  that  allows 
two  packets  with  the  same  destination  to  be  combined  to  form  a  single  packet,  should  they 
meet  at  a  node.  For  example,  one  of  the  packets  may  be  discarded,  or  the  data  carried  by  the 
two  packets  may  be  added  to  together.  Combining  is  used  in  the  emulation  of  concurrent-read 
concurrent- write  parallel  random-access  machines  [81]  and  distributed  random-access  machines. 
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If  the  congestion  is  to  remain  *  lower  bound  when  combining  is  allowed  then  its  definition 
must  be  modified  slightly.  The  congestion  of  an  edge  is  the  number  of  different  destinations  for 
which  at  least  one  packet’s  path  uses  the  edge.  Thus,  several  packets  with  the  same  destination 
contribute  at  most  one  to  the  congestion  of  an  edge. 

If  packets  with  the  same  destination  are  to  be  efficiently  combined  by  the  algorithm,  then 
they  must  be  given  the  same  rank.  For  this  purpose,  a  random  hash  function  is  used  to 
generate  ranks  based  on  destination.  Since  ties  in  rank  are  broken  according  to  destination,  a 
node  won't  send  a  packet  in  one  of  its  input  queues  unless  it  is  sure  that  no  other  packet  for 
the  same  destination  wilt  arrive  later  in  the  other  queue.  Thus,  at  most  one  packet  for  each 
destination  traverses  an  edge. 

For  the  counting  argument  to  work,  the  ranks  assigned  by  the  hash  function  to  any  set  of 
in  packets  must  be  independent.  The  universal  hash  function  [17] 


rank(i)  ~  (jjt,  a;x'^  mod 


mod  w 


maps  a  destination  x  6  [0.P  -  I]  to  a  rank  in  [0,io  -  1]  with  in-way  independence.  Here  P  is 
a  prime  number  and  the  coefficients  a;  (•  Zp  are  chosen  at  random.  The  random  coefficients 
use  O(tnlogP)  random  bits.  In  fact,  it  suffices  to  choose  ranks  in  the  range  [(^logA  -  1]  such 
that  any  set  of  log  N  are  independent  (63,  82].  In  most  applications,  the  number  of  possible 
different  destinations  is  at  most  polynomial  in  JV,  so  the  hash  function  requires  only  O(log2  JV) 
bits  of  randomness. 


1.3.3  Applications 

In  Sections  1.4  through  1.9  we  examine  the  many  applications  of  the  0(c  +  d  +  log  A)-stcp 
scheduling  algorithm  for  layered  networks.  These  applications  include  routing  algorithms  for 
meshes,  butterflies,  multidimensional  arrays  and  hypercubes,  the  shuffle-exchange  graph,  and 
fat-trees.  Section  1.4  presents  the  simplest  application:  routing  JV  packets  in  0(VN)  steps  on 
a  VN  x  y/N  mesh.  Another  simple  application,  described  in  Section  1.5,  is  an  algorithm  for 
routing  N  packets  in  O(logJV)  steps  on  an  JV-node  butterfly.  The  mesh  and  butterfly  results 
were  previously  known  [82,  81],  but  are  included  for  completeness.  Next,  Section  1.6  presents 
an  algorithm  for  routing  kN  packets  on  an  Ar-node  A;-dimensional  array  with  maximum  side 
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length  M  in  0(0/)  steps. 

It  is  not  obvious  that  the  scheduling  algorithm  can  be  applied  to  the  shuffle-exchange  graph 
because  it  is  not  layered.  Nevertheless,  in  Section  1.7  we  show  how  to  route  ^-packets  in 
0(logAr)  steps  on  an  Af*node  shuHlo-exchangc  graph  by  identifying  a  layered  structure  in  a 
large  portion  of  the  graph.  In  Section  1.8,  we  show  how  to  adapt  the  scheduling  algorithm 
to  route  a  set  of  messages  with  load  factor  A  in  0(A  +  log  M)  steps  on  fat-tree  (56)  with  root 
capacity  A/.  The  fat-tree  routing  algorithm  leads  to  the  construction  of  an  Af-nodc  network 
with  area  0(Af)  that  can  simulate  any  other  network  of  area  0{N)  with  slowdown  O(logiV). 
Finally,  in  Section  1.0  the  scheduling  algorithm  is  used  as  a  subroutine  in  an  0(log  A^)-stcp 
sorting  algorithm  for  the  butterfly. 


1.4  Routing  on  meshes 

A  5  X  5  mesh  is  illustrated  in  Figure  1-10.  Each  node  has  a  distinct  label  (x,y),  where  x  is  its 
column  and  y  is  its  row.  In  an  n  x  n  mesh,  0  <  x,y  <  n  -  1.  Thus,  an  n  x  n  mesh  has  A'  =  n5 
nodes.  For  x  <  n  —  1,  node  (x,y)  is  connected  to  (x  +  l,y),  and  for  y  <  n  -  1,  node  (x,y) 
is  connected  to  (x,y  +  1).  Sometimes  wraparound  edges  are  included,  so  that  a  node  labeled 
(x,n  —  1)  is  connected  to  (x,0)  and  a  node  labeled  (n  -  l,y)  is  connected  to  (0,y). 

It  is  straightforward  to  apply  the  algorithm  described  in  Section  1.3  to  route  N  packets  on 
a  y/N  x  y/N  mesh  in  0{y/N)  steps.  The  algorithm  consists  of  four  phases.  In  the  first  phase 
only  those  packets  that  need  to  route  up  and  to  the  right  are  sent.  The  paths  of  the  packets 
are  selected  greedily  with  each  packet  first  traveling  to  the  correct  row,  and  then  to  the  correct 
column.  The  level  of  a  packet  is  the  sum  of  its  row  and  column  numbers.  This  simple  strategy 
guarantees  that  both  the  congestion  and  dilation  of  the  phase  are  0(y/N).  The  up-right  phase 
is  followed  by  up-left,  down-right,  and  down-left  phases.  This  algorithm  was  first  discovered 
by  Ranade  [82].  Although  0(\/iV)-step  routing  algorithms  for  the  mesh  were  known  before 
[41,  43,  97],  they  all  have  more  complicated  path  selection  strategies. 
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Figure  MO:  A  5  X  5  mesh. 

1.5  Routing  on  butterflies 

An  8-input  butterfly  nctuwrk  is  illustrated  in  Figure  1-11.  Each  node  has  a  distinct  label  (/,r), 
where  l  is  its  level,  and  r  is  its  row.  In  an  n-input  butterfly,  the  level  is  an  integer  between 
0  and  lgn,  and  the  row  is  a  lgn-bit  binary  number.  The  nodes  on  level  0  and  Ign  are  called 
the  inputs  and  outputs,  respectively.  Thus,  an  n-input  butterfly  has  N  =  n(!gn  +  1)  nodes. 
For  l  <  lgn,  a  node  labeled  (/,r)  is  connected  to  nodes  (/  +  l,r)  and  (/  +  l,rM),  where  rW 
denotes  r  with  the  /th  bit  complemented.  Sometimes  the  input  and  output  nodes  in  each  row 
arc  identified  as  the  same  node.  In  this  case  the  number  of  nodes  is  N  -  nlgn.  The  butterfly 
has  several  natural  recursive  decompositions.  For  example,  removing  the  nodes  on  level  0  (or 
lgn)  and  their  incident  edges  leaves  two  n/ 2-input  subbuttterflies. 

An  important  related  network  called  the  Bones  network  [10]  is  shown  in  Figure  M2.  An 
n-input  Bones  network  has  21ogn  +  1  levels  and  contains  2  n-input  butterflies  as  edge-disjoint 
subgraphs.  The  two  butterflies  share  nodes  only  on  level  logn.  The  first  butterfly  has  its  inputs 
on  level  0  of  the  Benes  network,  and  its  outputs  on  level  logn.  The  second  is  the  mirror  image 
of  the  first.  It  has  its  inputs  on  level  2 logn  +  1,  and  its  outputs  on  level  logn.  An  n-input 
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Figure  1-11:  An  8-input  butterfly  network.  Each  node  has  a  level  number  between  0  and  3, 
and  a  3-bit  row  number.  A  node  on  level  /  In  row  r  it  connected  to  the  nodes  on  level  /  +  1  in 
rows  r  and  where  where  rW  denotes  r  with  the  /th  bit  complemented. 


butterfly  can  emulate  an  n-input  Denes  network  with  constant  slowdown.  Waksman  [98]  proved 
that  the  inputs  and  outputs  of  a  Denes  network  can  be  connected  in  any  permutation  by  a  set 
of  node-disjoint  paths. 

Ranade  [81]  showed  that  the  scheduling  algorithm  for  layered  networks  can  be  applied  to  an 
N- node  butterfly  to  route  N  packets  in  0(logAr)-steps  using  constant  size  queues.  Routing  is 
performed  on  a  logical  network  consisting  of  4  lgn  +  1  levels.  The  first  lgn  levels  of  the  logical 
network  are  linear  arrays.  The  packets  originate  in  these  arrays,  one  to  a  node.  Levels  lgn 
through  31gn  form  a  Denes  network.  The  last  lgn  levels  are  again  linear  arrays.  Each  packet 
has  its  destination  in  one  of  these  arrays.  Packets  with  the  same  destination  are  combined. 
The  butterfly  simulates  each  step  of  this  network  in  a  constant  number  of  steps.  Paths  for  the 
packets  are  selected  using  Valiant’s  paradigm;  each  packet  travels  to  a  random  intermediate 
destination  on  level  21gu  before  moving  on  to  its  final  destination.  This  strategy  ensures  that 
with  high  probability  the  congestion  is  0(logAr),  so  that  the  total  time  is  O(logJV). 


Figure  1-12:  An  8-input  Bones  network  consists  of  two  back-to-back  8-input  butterfly  networks. 


1.6  Routing  on  multidimensional  arrays 


In  this  section  we  describe  a  randomized  algorithm  for  routing  kN  packets  on  an  JV-node  Ur- 
dimensional  array  in  0(kM)  steps  using  constant-size  queues,  where  Af  is  the  maximum  of  the 
side  lengths  A/i,...,A/*.  Special  cases  include  the  mesh  (Jb  =  2)  and  the  hypercube  (A/  =  2). 
For  arrays  of  dimension  greater  that:  two,  no  asymptotically-oplima]  constant-queue-size  routing 
algorithms  were  previously  known. 

A  ^.'-dimensional  array  with  side  lengths  A/;  >  2,  for  1  <  i  <  Jb,  has  JV  =  A/i  •  •  •  A/*  nodes 
and  kN  edges.  Each  node  has  a  distinct  label  (uij , . . . ,  injt),  where  0  <  vot  <  A/,-  -  1,  for 
1  <  i  <  A  node  has  one  outgoing  and  one  incoming  edge  for  each  dimension;  for  1  <  »  <  Jb, 
(tnlt...,i uic)  has  an  edge  to  (wi,...,tu,-  +  1  mod  Af, ■,..., ur*).  We  assume  that  at  each  step, 
a  node  may  simultaneously  transmit  a  packet  on  each  of  its  k  outgoing  edges,  and  receive  a 
packet  on  each  of  its  k  incoming  edges. 

In  order  to  apply  the  scheduling  algorithm  from  Section  1.3,  routing  is  performed  on  a 
bounded-degree  layered  logical  network  that  the  array  emulates.  The  logical  network  consists  of 
(2fc+l)  plateaus  labeled  0  through  2k,  each  consisting  of  N  logical  nodes.  Each  node  in  a.  plateau 
has  a  label  (u>i, . . . ,  w^)  distinct  from  the  labels  of  the  other  nodes  in  the  plateau.  We  begin  by 
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describing  (he  edges  in  plateaus  0  through  k.  A  node  on  plateau  i  has  edges  only  in  dimensions 
i  and  i+1.  If »  >  0  and  tv,<  <  Mi- 1,  then  the  node  labeled  (uj|,...,u>*)  has  an  edge  to  the  node 
in  the  same  plateau  with  label  (u>i,...,u>,'  -f  l,...,ui*).  Also,  if  i  <  k  and  w;+i  <  A/f+j  -  1, 
then  the  node  has  an  edge  to  (uq, . . . ,  w;+i  + 1, .  • . ,  w*).  The  only  connections  to  plateau  i  + 1 
come  from  nodes  with  tu.'+i  =  M  —  1.  For  i  <  k ,  (u>i,...,u>;,Af,>i  -  l,w,+a on  plateau 
t  is  connected  to  (uq  ,...,u/,-,0,u;l+3,...,u;*)  on  plateau  i+ 1.  Plateau  k  is  connected  to  plateau 
k  +  1  by  dimension  1  edges.  Plateaus  k  +  l  through  2 it  are  essentially  a  copy  of  plateaus  1 
through  k.  The  edges  on  plateau  k  +  i,  1  <  i  <  k  are  given  by  the  same  rules  as  the  edge  on  on 
plateau  i.  The  level  of  node  in  plateau  i,  0  <  i  <  k,  is  i  wj  +  T.)m\  Mj.  For 

k  <i  <  2Jt,  the  level  is  £;M,  to;  +  A/;  +  £;*i  Mj.  The  network  is  layered  because  jach 

edge  connects  a  pair  of  nodes  on  adjacent  levels. 

Each  step  of  the  logical  network  can  be  emulated  by  the  array  in  a  constant  number  of  steps. 
The  array  node  labeled  (uq , . . . ,  to*)  emulates  all  of  the  logical  nodes  with  the  same  label,  one 
for  each  of  the  2Jt  +  1  plateaus.  The  array  edge  from  (tull...,tol-,...,to*)  to  (u>|,...,to,*  + 
1  mod  Af,*,.. .  ,u>*)  emulates  at  most  four  logical  edges,  one  each  on  plateaus  i  -  1,  i,  it  +  i  -  1 
and  k  -f ». 

Paths  for  the  packets  are  selected  using  Valiant’s  paradigm.  Initially  each  node  on  plateau 
0  holds  k  packets  in  an  initial  queue.  A  packet  travels  from  its  origin  on  plateau  0  to  a  random 
destination  on  plateau  k ,  then  continues  on  to  its  true  destination  on  plateau  2Jt.  Suppose 
that  a  packet  originating  at  (ii, . . .  ,x*)  on  plateau  0  is  to  pass  through  (n, . . .  ,r*)  on  plateau 
k  on  its  way  to  (yi,...,y*)  on  plateau  2k.  In  the  first  half  of  the  path  plateau  i  is  used 
to  make  the  ith  component  of  the  packet's  location  match  the  tth  component  of  its  random 
destination.  The  packet  enters  plateau  f  >  1  at  node  (n , . . . ,  r;_i ,  0,  *,-+i , . . . ,  x*)  and  traverses 
dimension  i  edges  to  (rj , . . . , r;, x,+i , ...  ,xj.).  The  packet  then  traverses  dimension  i  +  1  edges 
to  (ri,... ,r;, A/f+i  —  l,xi+2,...,xjt)  and  crosses  over  to  node  (ri,...,r;,0,x,'+2,...,x*)  on 
plateau  i  +  1.  In  the  second  half  of  the  path,  plateau  k  + 1  is  used  to  make  the  ith  component 
of  the  packet’s  location  match  the  ith  component  of  the  true  destination  in  a  similar  fashion. 
The  following  lemma  shows  that  with  high  probability,  the  congestion  of  the  paths  is  at  most 
0(JtA/). 


Lemma  12  For  any  kj,  there  is  a  k'2  such  that  the  probability  that  c  >  k^kM  is  at  most  1/jV*1 . 
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Proof:  For  simplicity  we  analyze  congestion  in  the  first  half  of  the  network  only.  The  calculation 
for  the  second  half  is  identical. 


We  begin  by  bounding  the  probability  that  a  particular  edge  is  congested.  There  are  two 
parts  to  the  calculation:  counting  the  number  of  packets  that  can  possibly  use  the  edge,  and 
bounding  the  probability  that  an  individual  packet  actually  does  so.  First,  we  count  packets 
that  can  use  the  edge.  Consider  an  edge  on  plateau  t  from  (ici,...,*)*)  to  (wi,...,«\-  + 
1  mod  ,..., wk).  Since  a  packet  does  not  use  any  dimension  i  +  1  through  k  edges  before 
it  uses  a  dimension  i  edge,  any  packet  that  uses  the  edge  must  come  from  an  origin  whose 
last  k  -  i  components  x.+i  through  x*  match  u!,-+i  through  w*.  There  are  at  most  Mi  "'Mi 
such  origins,  each  transmitting  k  packets.  Next  we  bound  the  probability  that  each  of  these 
packets  actually  uses  the  edge.  A  packet  uses  the  edge  only  if  components  rt  through  r;_j  of 
its  random  destination  match  uij  through  The  probability  that  these  components  match 
is  1  /M\ ' "  Mi-i. 

Since  the  random  destinations  are  chosen  independently,  the  number  of  packets,  5,  that  pass 
through  the  edge  has  a  binomial  distribution.  The  probability  that  more  than  kjkM  packets 


use  an  edge  is  at  most 


Pr(S  >  S  f^"') 


Using  the  inequalities  M{  <  M  for  l  <  i  <  Jfc,  and  (J)  <  (^)\  we  have  Pr[5  >  JfcjJbAf]  < 

(*)*■"• 

s 

To  bound  the  probability  that  any  edge  is  congested,  we  simply  sum  the  probabilities  that 


each  particular  edge  is  congested,  i.e., 

(.  \  MAf 

yj 

For  any  Jfcj,  there  is  a  Jtj  such  that  this  probability  is  at  most  1  /Nkl. 


□ 


Theorem  13  For  any  ki,  there  is  a  A’j  such  that  the  probability  that  any  packet  is  r.ui  delivered 
by  step  k^kM  is  at  most  l/N^. 


Proof:  With  high  probability,  the  scheduling  algorithm  from  Section  1.3  delivers  all  packets  in 
0(c+d+log  Ar)  steps.  The  number  of  levels  is  0(kM),  and  by  Lemma  12  with  high  probability 
the  congestion  is  0(kM).  Also,  log  jY  <  UM.  □ 
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Figure  1-13:  An  8-node  shuffle-exchange  graph.  Shuffle  edge*  are  solid,  exchange  edge*  dashed. 

1.7  Routing  on  shuffle-exchange  graphs 

In  this  section,  we  present  a  randomized  algorithm  for  routing  any  permutation  of  N  packets  on 
an  iV-node  shuffle-exchange  graph  in  O(log  N)  steps  using  constant-size  queues.  The  previous 
0(log  Ar)-time  algorithms  [3]  required  queues  of  size  fi(lcg  N). 

Figure  1*13  shows  an  8-nodc  shuffle-exchange  graph.  Each  node  is  labeled  with  a  unique 
IgAf-bit  binary  string.  A  node  labeled  o  =  a^s-i  ",ao  i«  linked  to  a  node  labeled  b  = 
b\tN-i  “'bo  by  a  shujflc  edge  if  rotating  a  one  position  to  the  left  or  right  yields  b,  i.e.,  if 
cither  b  =  or  b  =  a\fs-2a\tN-3’"ao<ll%N-\-  Two  nodes  labeled  a  and 

b  arc  linked  by  an  exchange  edge  if  a  and  b  differ  in  only  the  least  significant  (rightmost)  bit, 
i.e.,  b  —  a^s-i  •  ••ni^o-  In  the  figure,  the  shuffle  edges  arc  solid,  and  the  exchange  edges  are 
dashed. 

The  removal  of  the  exchange  edges  partitions  the  graph  into  a  set  of  connected  components 
called  necklaces.  Each  necklace  is  a  ring  of  nodes  connected  by  shuffle  edges.  If  two  nodes  lie 
on  the  same  necklace,  then  th<*ir  labels  arc  rotations  of  each  other.  Due  to  cyclic  symmetry, 
the  number  of  nodes  in  the  necklaces  difier.  For  example,  in  a  64-node  shuflle-exchange  graph, 
the  nodes  010I0I  and  101010  form  a  2  node  necklace,  while  011011, 110110,  and  101101  form 
a  3-node  necklace.  For  each  necklace,  the  node  with  the  lexicographically  minimum  label  is 
chosen  to  be  the  necklace’s  representative. 
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1.7.1  Good  and  bad  nodes 

Unlike  the  mesh  and  butterfly  networks,  the  shuttle-exchange  graph  cannot  emulate  a  layered 
network  in  a  transparent  fashion.  Nevertheless,  It  is  still  possible  to  apply  the  0(c  +  d+  log  JV) 
scheduling  algorithm  for  layered  networks  to  the  problem  of  routing  on  the  shuffle-exchange 
graph.  The  key  idea  is  that  a  large  subset  of  the  shuffle-exchange  graph  (at  least  Nf 5  nodes) 
can  emulate  a  layered  network.  We  call  these  nodes  good  nodes.  The  rest  of  the  nodes  are  bod. 

A  node  can  be  classified  as  bad  for  one  of  three  reasons:  (1)  its  label  does  not  contain  a 
substring  of  lglg  A  consecutive  Q's  (we  consider  the  rightmost  and  leftmost  b  is  in  a  label  to 
be  consecutive),  (2)  its  label  contains  at  least  two  disjoint  longest  substrings  of  at  least  Iglg/V 
consecutive  O's,  or  (3)  its  label  is  0  •  •  -  0.  Thus,  the  label  of  every  good  node  contains  a  unique 
longest  substring  of  O's  with  length  at  least  lglg  jV.  For  simplicity,  we  assume  that  lglg  A  is 
integral,  and  that  lgjV  >  lglg  A\ 

Since  the  length  of  a  substring  of  consecutive  0’s  in  a  label  is  not  changed  by  rotation,  a 
necklace  consists  either  entirely  of  good  nodes  or  entirely  of  bad  nodes.  Furthermore,  each 
good  necklace  consists  of  lg  N  good  nodes  since  a  unique  longest  substring  of  consecutive  0’s 
precludes  cyclic  symmetry. 

In  order  to  route  packets  between  all  N  nodes  of  the  shuffle-exchange  graph,  we  associate 
the  bad  nodes  with  good  nodes.  A  type-1  bad  node  is  associated  with  a  good  node  by  changing 
the  least  significant  bit  of  its  label  to  a  1  and  the  lglg  Ar  most  significant  bits  to  0’s.  Each  bad 
necklace  of  type  2  is  associated  with  a  good  necklace  by  changing  the  two  bits  following  the 
leading  group  of  O's  in  its  representative’s  label  to  01.  Finally,  the  node  0---0  is  associated 
with  its  neighbor  0  •  •  •  01. 

Lemma  14  At  most  dig  Ar  bad  nodes  are  associated  with  any  good  necklace. 

Proof:  Each  type-1  bad  node  is  associated  with  the  representative  of  a  good  necklace  since, 
after  the  transformation,  the  longest  string  of  consecutive  O’s  begins  with  the  most  significant 
bit.  Only  bad  nodes  whose  labels  differ  from  the  representative’s  label  in  at  most  lglgJV+  1 
bits  are  associated  with  it,  so  at  most  21<[lsW+1  =  21g  N  type-1  bad  nodes  are  associated  with 
any  good  necklace. 

To  assess  the  number  of  type-2  bad  nodes  associated  with  a  good  necklace,  we  consider 
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the  label  of  the  representative  of  the  good  necklace  and  notice  that  only  a  bad  necklace  whose 
representative’s  label  differs  in  the  last  bit  of  its  leading  block  of  0’s  and  possibly  the  bit  after 
that  can  be  mapped  to  the  good  necklace.  Thus,  at  most  two  typc-2  bad  necklaces  arc  associated 
with  any  good  necklace. 

Finally,  no  bad  nodes  of  cither  type  1  or  2  arc  associated  with  the  necklace  of  node  0  *  •  *01. 

□ 

Corollary  15  At  least  Nf$  of  the  nodes  are  good. 

Proof:  Dy  Lemma  Id  at  most  tig Af  bad  nodes  are  associated  with  any  good  necklace.  Since 
every  good  necklace  contains  exactly  Ig  N  nodes,  at  least  N/ 5  of  the  nodes  are  good.  □ 

The  remainder  of  this  section  provides  the  details  of  the  routing  algorithm.  We  begin  by 

describing  a  logical  layered  network  that  the  good  nodes  can  easily  emulate  with  constant  over* 

« 

head.  Next,  we  show  that,  for  any  routing  problem,  choosing  random  intermediate  destinations 
yields  paths  with  congestion  and  dilation  0(log  N)  in  this  network,  with  high  probability.  Thus, 
by  applying  the  analysis  of  Section  1.3,  routing  on  the  logical  network  takes  0(log  N)  steps,  with 
high  probability,  and  uses  constant-sized  queues.  We  conclude  by  describing  the  deterministic 
routing  between  good  and  bad  nodes. 

1.7.2  A  layered  network 

The  level  of  a  node  is  determined  by  the  distance  to  the  representative  node  in  its  necklace.  An 
alternate  way  to  write  a  node’s  label  is  to  place  a  i>ne  under  its  least  significant  bit  (which  we 
call  the  current  bit),  and  then  rotate  it  until  it  matches  its  representative’s  label.  For  example, 
110001  can  also  be  written  000111.  The  level  of  a  node  is  the  position  of  the  current  bit, 
counting  from  the  left.  For  example,  000111  lies  on  level  3.  (Note  that  the  representative  node 
lies  on  level  lgjV  -  1.) 

The  problem  with  this  leveling  scheme  is  that  although  it  induces  a  leveling  of  the  shift 
edges,  it  does  not  necessarily  induce  a  leveling  of  the  exchange  edges.  An  exchange  edge  may 
create  a  new  longest  substring  of  0’s  by  appending  two  substrings  separated  by  a  single  1,  and 
thus  connect  two  levels  which  are  very  far  apart. 
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To  overcome  this  difficulty,  we  replace  the  exchange  edges  with  flip  edges.  A  Rip  edge  links 
nodes  labeled  a  and  b  if  both  are  good,  a  =  «igiv-i  5  =  *«d  aj 

is  not  in  the  longest  block  of  Q's  of  «.  Note  that  a  flip  edge  extends  a  group  of  0’s  by  at  most 
one.  Thus  no  flip  edge  can  create  a  new  leading  group  of  0’*,  because  if  it  grew  a  shorter  group 
to  be  as  big  as  the  leading  group,  then  it  would  lead  to  a  bad  node  of  type  2,  a  contradiction 
since  flip  edges  occur  only  between  good  nodes  by  definition.  Thus  flip  edges  are  leveled.  The 
operation  of  the  flip  edges  can  be  emulated  by  the  shuffle-exchange  graph  with  only  a  constant 
factor  of  slowdown;  each  flip  edge  is  composed  of  an  exchange  edge,  a  shuffle  edge,  and  possibly 
another  exchange  edge. 

We  denote  by  A  the  network  composed  of  the  good  nodes,  the  shuffle  edges  (excluding  the 
shuffle  edges  from  level  Ig  N  - 1  to  0),  and  the  flip  edges.  Note  that  in  network  A,  from  any 
level  0  node  we  can  reach  any  necklace  with  a  longest  string  of  O’*  having  the  same  or  greater 
length  by  correcting  bits  starting  from  the  cud  of  the  leading  block  of  0’s. 

t 

In  fact,  we  wish  to  be  able  to  get  from  the  level  0  node  of  necklace  to  any  other  necklace. 
Thus  we  append  a  mirror  image  A  to  itself  so  that  we  can  reach  necklaces  with  fewer  O’*.  The 
leveling  is  extended  in  the  natural  manner.  We  call  this  whole  thing  network  AAr,  and  note 
that  network  A  can  easily  emulate  it. 

We  denote  by  L  the  network  consisting  of  the  shuffle  edges  on  the  good  nodes  again  excluding 
shuffle  edges  from  level  lg  N  -  1  to  level  0.  Our  method  of  path  selection  consists  of  routing 
from  a  good  node  to  its  level  0  node,  then  routing  to  a  random  intermediate  necklace,  then 
routing  to  the  destination  necklace,  and  finally  routing  to  the  appropriate  good  node.  Thus, 
we  route  in  a  layered  network  composed  of  network  L,  network  AAr,  another  network  AAr, 
followed  by  network  L.  We  extend  the  leveling  in  the  natural  manner  and  note  that  network  A 
can  easily  emulate  the  whole  thing. 

1.7.3  Path  selection  and  congestion 

For  each  packet  we  choose  its  path  by  uniformly  choosing  a  random  good  necklace  to  route 
through  before  going  to  its  final  destination.  So  the  path  for  a  packet  consists  of  a  path 
through  L  to  node  0  of  its  necklace,  the  path  through  AAr  to  its  random  intermediate  necklace, 
the  path  through  the  second  AAr  to  its  destination  necklace,  and  a  path  through  the  second  L 
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to  the  proper  no<le  oC  the  ncckUce. 

The  following  lemma,  shows  that  if  at  most  0(log  N)  packet*  originate  and  terminate  in  each 
good  necklace,  then  this  method  yields  paths  with  congestion  O(log  A?)  with  high  probability. 

Lemma  16  Support  that  each  good  necklace  tend*  and  receives  at  most  Mg  A?  packets,  i cherc 
b  is  a  fixed  constant.  Then  for  any  constant  ku  there  is  a  constant  kj  such  that  the  probability 
that  more  than  k j IgA?  packets  use  any  edge  is  at  most  1/A?*1 . 

Proof:  We  observe  that  for  the  path*  in  the  copies  of  L,  we  have  congestion  Mg  A/,  since  at 
most  big  A?  packets  start  or  end  in  any  good  necklace.  By  symmetry  we  claim  that  the  analysis 
of  the  path  portions  in  both  copies  of  AAr  is  the  same.  Finally  we  recall  that  in  sMr,  we  route 
packets  going  to  necklaces  with  same  or  more  Q’s  to  the  appropriate  necklace  in  network  A 
and  straight  across  network  Ar ,  and  we  route  the  other  packets  straight  across  in  network  A 
and  use  Ar  to  route  to  the  proper  necklace.  We  will  show  that  any  destination  necklace  gets 
O(logn)  packets  with  high  probability,  so  the  straight  across  portion  of  the  paths  should  not 
be  a  problem.  To  finish,  we  give  the  analysis  of  the  congestion  due  to  packets  in  just  network 
A,  and  claim  that  the  arguments  will  hold  by  symmetry  for  AAr. 

Consider  an  edge  in  the  first  copy  of  network  A.  In  this  half,  packets  going  to  necklaces  with 
fewer  leading  0’s  arc  routed  straight  across  A.  There  are  at  most  6 IgA?  of  these,  so  without 
loss  of  generality  wc  ignore  them.  Suppose  that  e  traverses  levels  m  and  m  +  1.  Let  i  be 
the  number  of  0*s  in  the  necklace  to  which  e  goes.  If  m  <  x,  then  no  packet  from  any  other 
necklace  uses  c,  since  wc  only  map  to  a  necklace  via  flip  edges  after  its  longest  string  of  0’s. 
Otherwise,  wc  consider  the  number  of  packets  from  other  necklaces  that  can  use  c.  We  know 
that  only  packets  from  at  most  2*  other  necklaces  with  f  =  m  -  lglgA?  could  have  used  e  since 
at  most  /  bits  could  have  changed  by  level  m  +  1.  Thus  the  number  of  packets  that  can  use 
e  is  at  most  6  *2*lg  Ar  since  each  necklace  starts  with  at  most  MgAr  packets.  The  probability 
that  a  specific  packet  uses  e,  is  the  number  of  necklaces  that  can  be  reached  using  e,  at  most 
2lsN-l«Utf-*  necklaces  which  match  c’s  necklace  in  the  first  /  -f  lglgAr  bits),  divided  by 
the  total  number  of  good  necklaces,  at  least  A?/51gA?,  which  is  just  5/2*. 

The  pre ''ability  that  more  than  I^lgA?  packets  use  e  is  at  most 

(b-2'\gN\  ($_ 

V  *2 IgA'  j  W 
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since  there  are  5*2*  lg  N  Bernoulli  trials,  each  succeeding  with  probability  5/2*.  The  probability 
that  any  of  the  0(N)  edge*  of  this  stage  has  congestion  more  than  kjlgN  is  0(N )  times  this 
probablity.  For  any  k\%  we  can  bound  the  product  by  l/iV*1  by  choosing  ki  large  enough.  □ 

Because  the  congestion  and  number  of  levels  are  O(log  A),  with  high  probability,  the  time 
to  route  the  packets  between  the  good  nodes  is  also  0(logAr),  with  high  probability,  and  the 
queue  size  is  constant. 

1.7.4  Packets  from  bad  nodes 

In  this  section  we  show  how  to  deterministically  route  the  packets  from  the  bad  nodes  to  their 
associated  good  necklaces. 

Lemma  17  Packets  from  bad  nodes  art  routed  to  the  associated  good  necklaces  deterministically 
in  0(\ogN)  time  using  constant-size  queues. 

• 

Proof:  Recall  that  we  associate  a  bad  node  of  type  1  with  the  necklace  represented  by  a  1  in 
the  least  significant  or  current  bit  plus  lglg N  0’s  in  the  most  significant  bits.  We  route  these 
packets  in  the  shufTlc  exchange  graph  by  flipping  the  current  bit  to  a  1  and  flipping  IglgJV  bits 
to  the  right  to  0’s.  Thus  we  map  a  bad  node  to  a  good  necklace  at  its  level  IglgAf  node. 

For  any  necklace,  we  have  a  binary  tree,  the  leaves  of  which  arc  mapped  to  the  necklace. 
Each  level  of  the  tree  corresponds  to  one  of  the  lglg  AT  +  1  bits  that  were  flipped.  Therefore, 
we  can  route  packets  from  the  binary  tree  leaves  to  the  necklace,  and  distribute  them  along 
the  necklace  deterministically.  This  is  easily  done  in  O(log  N)  time  with  constant  queues.  The 
routing  from  the  necklace  to  the  tree  Is  equally  trivial.  But,  we  need  to  ensure  that  traffic  from 
the  separate  binary  trees  does  not  interfere  too  much.  This  is  easy  since  any  bad  node  is  in 
at  most  two  binary  trees;  in  at  most  one  as  a  leaf  since  any  node  is  mapped  to  exactly  one 
good  node,  and  in  at  most  one  as  an  internal  node  since  the  number  of  0's  between  the  current 
node  and  the  closest  1  to  the  left  determines  a  unique  level  and  the  rest  of  the  bits  determine 
a  unique  tree. 

To  finish,  we  consider  bad  nodes  of  type  2.  These  are  nodes  without  a  unique  longest  string 
of  0’s.  Here  we  extend  one  of  the  groups  of  0’s  by  one  0,  making  sure  not  to  join  two  groups  of 
0’s  by  inserting  a  1,  mimicking  the  flip  operation.  For  any  good  necklace  whose  representative 
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is  0*1...  only  the  necklaces  represented  by  0*"l10...  and  0*“*  11...  can  he  mapped  to  it.  Again, 
at  most  two  bad  necklaces  arc  associated  with  any  good  icklacc. 

For  each  packet  in  such  a  bad  necklace  we  route  it  through  the  node  connecting  it  to  the 
appropriate  good  necklace.  We  perform  this  movement  by  pipelining  the  packets  through  the 
edge  which  connects  the  two  necklaces.  We  see  that  this  mapping  maps  at  most  one  packet 
from  the  bad  necklace  to  a  node  in  the  good  necklace.  Since  we  are  basically  routing  on  linear 
arrays  of  length  at  most  2!g  Af,  2!gjV  steps  suffice  to  route  the  packets  appropriately.  Thus, 
dlg/V  steps  are  sufficient  to  route  the  packets  from  two  bad  necklaces. 

This  finishes  the  description  of  the  maps  to  and  from  all  the  bad  nodes  except  for  node 
0  •  •  •  0,  which  is  adjacent  to  node  0  •  •  •  01.  □ 

1.7.5  Summary 

The  main  result  of  this  section  is  summarized  in  the  following  theorem. 

Theorem  18  With  high  probability,  an  N-nodc  shuffle-exchange  graph  can  route  any  permu¬ 
tation  of  N  packets  in  0(log  iY)  steps  using  constant-size  queues. 

Proof:  There  arc  three  phases  to  the  algorithm.  First,  packets  originating  at  bad  nodes  are 
deterministically  routed  to  the  good  nodes  with  which  they  arc  associated.  By  Lemma  17  this 
phase  requires  0(log  Ar)  steps.  Next,  packets  are  routed  between  the  good  nodes  on  the  logical 
network.  Since  at  most  -tlg/V  bad  nodes  arc  associated  with  each  good  necklace,  with  high 
probability  the  congestion  of  the  paths  on  the  logical  network  is  O(logAT),  Lemma  16.  Thus, 
this  phase  requires  O(logiY)  steps,  with  high  probability.  The  packets  are  routed  in  0( log  A') 
steps  using  the  scheduling  algorithm  from  Section  1.3.  Finally,  packets  destined  for  bad  nodes 
arc  deterministically  routed  from  the  good  nodes  to  bad.  By  an  analysis  similar  to  that  of 
Lemma  17,  this  phase  also  requires  OflogiV)  steps.  □ 


1.8  Construction  of  area  and  volume-universal  networks 

In  this  section  we  construct  a  class  of  point-to-point  networks  that  are  area-universal  in  the  sense 
that  a  network  in  the  class  with  N  processors  has  area  O(N)  and  can,  with  high  probability, 
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simulate  in  0(logAr)  steps  each  message-step  of  any  shared-bus  network  of  area  0(N).  The 
simulation  is  optimal  because  a  point-to-point  network  may  require  fl(log  JY)  steps  to  simulate 
one  step  of  a  shared-bus  network.  The  networks  are  based  on  the  fat-trees  of  Greenberg  and 
T.cis*rson  (20]  and  the  simulation  uses  the  message  routing  algorithm  from  Section  3. 

In  a  fixed-connection  network,  processors  communicate  via  wires.  Each  processor  has  a 
bounded  number  of  read  and  write  pins.  In  a  point-to-point  fixed-connection  network,  each 
wire  connects  one  read  pin  with  one  write  pin.  In  each  message-step,  the  processor  with  the 
write  pin  may  transmit  a  message  of  0(logAr)  bits  to  the  processor  with  the  read  pin.  In  a 
shared-bus  fixed-connection  network,  a  wire  may  connect  many  read  and  write  pins.  Such  a 
wire  is  called  a  bus.  In  each  message-step,  any  processors  wishing  to  send  messages  make  them 
available  on  their  write  pins.  Then  the  messages  at  the  write  pins  of  each  wire  are  combined  by 
some  simple  rule  to  form  a  single  message  Combining  is  assumed  to  require  a  single  message- 
step,  regardless  of  the  number  of  messages  combined  or  the  rule  used. 

Lciscrson  was  the  first  to  display  a  class  of  fixed-connection  networks  that  could  efficiently 
simulate  any  other  network  of  the  same  area  or  volume.  In  [56]  he  showed  that  a  fat-tree  of  area 
O(N)  can  simulate  in  0(log3  JY)  bit-steps  each  bit-step  of  any  point-to-point  fixed-connection 
network  of  area  O(JY).  The  simulation  used  an  off-line  routing  algorithm  for  fat-trees.  On-line 
routing  algorithms  were  later  developed  by  Greenberg  and  Lciserson  [29]  and  Park  [73].  None 
of  these  routing  algorithms  arc  capable  of  combining  messages  to  the  same  destination.  As  a 
consequence,  no  scheme  for  simulating  shared-bus  networks  was  known  until  now.  A  network 
that  can  simulate  in  0(1)  steps  each  step  of  any  shared-bus  network  area  of  equal  area  was 
presented  in  [69],  However,  the  connections  in  this  network  are  not  fixed,  but  instead  processors 
communicate  via  reconfigurable  busses. 

A  fat-tree  network  is  shown  in  Figure  1-14.  Its  underlying  structure  is  a  complete  4-ary 
tree.  Each  edge  in  the  4-ary  tree  corresponds  to  a  pair  of  oppositely  directed  groups  of  wires 
called  channels.  The  channel  directed  from  the  leaves  to  the  root  is  called  an  up  channel;  the 
other  is  called  a  down  channel.  The  capacity  of  a  channel  c,  cap(c),  is  the  number  of  wires  in 
the  channel.  We  call  the  tree  “fat”  because  the  capacities  of  the  channels  grow  by  a  factor  of  2 
at  every  level.  A  fat-tree  of  height  m  ..  >s  M2  =  22m  leaves  and  M  =  2m  vertices  at  the  root. 

It  will  prove  useful  to  label  the  switches  at  the  top  and  bottom  of  each  channel.  Let  the 
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m*»2 

Figure  1-14:  A  fat-tree. 


level  of  a  switch  be  its  distance  from  the  leaves,  Suppose  a  channel  c  connects  cap( c)/2  =  2* 
switches  at  level  /  with  cap(c)  =  2**1  switches  at  level  /  +  1.  Give  the  switches  at  level  l  labels 
0  through  2'  - 1  and  the  switches  at  level  /  + 1  labels  0  through  2,+1  - 1.  Then  switch  k  at  level 
l  is  connected  to  switches  k  and  k  +  2J  at  level  l  +  1.  The  following  lemma  relates  the  labels  of 
the  switches  on  a  message’s  path  from  a  leaf  to  the  root. 

Lemma  19  There  is  a  unique  shortest  path  from  any  leaf  to  a  suSitch  labeled  k  at  the  root,  for 
0  <  k  <  M  -  1,  and  that  jrath  passes  through  a  switch  labeled  k  mod  21  at  level  l,  forO  <  l  <  m. 

□ 

For  a  set  Q  of  messages  to  be  delivered  between  the  leaves  of  the  fat-tree,  we  define  the  load 
of  Q  on  a  channel  c,  load(Q,c),  to  be  the  number  of  destinations  of  messages  in  Q  for  which 
at  least  one  message  must  pass  through  c.  Note  that  even  if  many  messages  with  the  same 
destination  must  pass  through  a  channel,  that  destination  contributes  at  most  one  to  the  load 
of  the  channel.  We  define  the  load  factor  of  Q  on  c,  A(Q,c),  to  be  the  ratio  of  the  load  of  Q  on 
c  to  the  capacity  of  c,  A (Q,c)  -  load(C,c)/cap(c).  The  load  factor  on  the  entire  network,  A (Q) 
is  simply  the  maximum  load  factor  on  any  channel  A(Q)  =  maxcA(Q,c).  The  load  factor  is  a 
lower  bound  on  the  the  number  of  steps  required  to  deliver  Q.  We  shall  assume  that  A  <  Mk, 
where  k  is  some  fixed  constant.  We  shall  sometimes  write  A  to  denote  A (Q)  when  the  set  of 
messages  to  be  delivered  is  clear  from  the  context. 

In  a  layered  fat-tree  a  switch  at  the  top  of  an  up  channel  at  level  /  is  connected  to  itself 
at  the  top  of  the  corresponding  down  channel  by  a  linear  chain  of  switches  of  length  2(m  -  /). 
A  message  may  only  make  a  transition  from  an  up  channel  to  a  down  channel  by  traversing  a 
chain.  Thus  all  shortest  paths  between  leaves  in  a  layered  fat-tree  have  length  2m.  Note  that 
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the  load  of  a  net  of  messages  on  a  channel  of  the  layered  fat-tree  is  identical  to  the  load  on  the 
corresponding  channel  in  the  fat-tree. 

The  path  that  a  message  for  destination  x  in  column  2m  lakes  through  a  layered  fat-tree  is 
determined  by  the  m-univcrsal  hash  function  (17} 


paih(x)  =s  ^53  mo<*  ^  m0<* 


where  P  is  a  prime  number  larger  than  the  number  of  possible  different  destinations,  and  the 
a;  €  Zp  arc  chosen  at  random  off-line.  A  message  with  destination  x  follows  up  channels  until 
it  can  reach  x  without  using  any  more  up  channels.  It  then  crosses  over  to  a  down  channel  via 
a  chain,  and  follows  down  channels  to  x.  Note  that  a  message  only  passes  through  a  channel 
if  it  must.  Also,  all  messages  with  destination  x  that  pass  through  channel  c  pass  through 
switch  (path(x)  mod  cap(c))  at  the  top  of  c  and  through  switch  (path(x)  mod  (cap(c)/2))  at 
the  bottom  of  c. 

■ 

The  following  lemma  shows  that  we  can  use  the  scheduling  algorithm  from  Section  3  to 
route  messages  in  a  fat-tree. 


Lemma  20  For  any  constant  ci,  there  is  a  constant  cj  such  that,  the  probability  that  the  number 
of  steps  required  to  deliver  a  set  Q  of  N  messages  with  load  factor  A  is  more  than  Cj(A  +  log  M) 
is  at  most  1/Mei ,  provided  that  N  is  polynomial  in  M. 

Proof:  The  paths  of  the  messages  arc  first  randomized  using  the  universal  hash  function  path. 
With  high  probability,  the  resulting  congestion  is  c  =  0( A  +  log  M).  Each  message  travels  a 
distance  of  d.  —  2m  =  2  log  M.  The  messages  arc  then  scheduled  using  the  algorithm  from 
Section  3.  □ 

Let  us  now  consider  the  VLSI  area  requirements  [03]  of  fat-trees.  A  lat-trcc  with  root 
capacity  iff  and  0(M2)  processors  has  a  layout  with  area  0(A/2  log2  M)  that  is  obtained  by 
embedding  the  fat-tree  in  the  tree  of  mcshes[<lG].  The  nodes  of  the  tree  of  meshes  in  this  layout 
are  separated  by  a  distance  of  IgA/  in  both  the  horizontal  and  vertical  directions.  Thus,  the 
O(logitF)  space  for  the  chain  associated  with  each  processor  in  the  layered  fat-tree  can  be 
allocated  without  increasing  the  asymptotic  area  of  the  layout.  (In  fact,  it  is  possible  to  attach 
a  chain  of  size  0(log2  M)  to  each  fat-tree  node  without  increasing  the  area  by  more  than  a 
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constant  factor.)  The  leaves  of  the  fat- tree  are  separated  in  the  layout  from  each  other  by  a 
distance  of  lgAf  in  each  direction.  We  can  improve  the  density  of  processors  without  increasing 
the  asymptotic  area  of  the  layout  by  connecting  a  Ig  M  x  Ig  M  mesh  of  processors  to  each  leaf. 
The  resulting  network  has  0(A/2  log2  M)  processors  and  area  0(A/2  log2  M).  The  N- processor 
network  in  this  class  has  root  capacity  ©(v/77/logfV),  Q(N/\og7  N)  leaves,  and  area  Q(N). 

The  following  theorem  shows  that  this  class  of  networks  is  area-universal. 

Theorem  21  With  high  probability,  an  N-processor  point-to-point  fixed- connection  netwerk 
U  of  area  0(A)  can  simulate  in  0(logAr)  steps  each  step  of  any  shared-bus  fixed-connection 
network  D  of  area  O(N). 

Proof:  The  processors  of  the  shared-bus  network  D  are  mapped  to  the  processors  of  the  area- 
universal  network  U  off-line  using  a  recursive  decomposition  technique  as  in  [56].  In  each  step, 
a  wire  of  B  is  simulated  by  routing  messages  between  the  processors  that  it  connects.  At  each 
level  of  the  recursion  at  most  0(cap(c)  •  log  jV)  wires  connect  the  processors  mapped  below  a 
channel  c  with  the  rest  of  the  network.  This  property  of  the  mapping  ensures  that  the  load 
factor  of  each  set  of  messages  used  in  the  simulation  of  B  is  at  most  0(log  A).  At  the  bottom 
of  the  decomposition  tree,  a  0(iog  A)  x  0(log  A)  region  of  the  layout  of  B  is  mapped  to  each 
leaf  oi  the  fat-tree.  The  O(logA)  x  O(logA)  mesh  connected  to  the  leaf  in  U  simulates  this 
region  of  B  using  standard  mesh  routing  algorithms.  □ 

The  study  of  fat-tree  routing  algorithms  that  perform  combining  was  motivated  in  part 
by  an  abstraction  of  the  volume  and  area-universal  networks  called  the  distributed  random- 
access  maclu  (DRAM).  A  host  of  conservative  algorithms  for  tree  and  graph  problems  for  the 
exclusive-read  exclusive-write  (EREW)  DRAM  are  presented  in  [58].  Recently  we  discovered 
conservative  concurrent-read  concurrent-write  (CRCW)  algorithms  that  require  fewer  steps 
for  some  of  these  problems.  Until  now,  however,  no  efReient  fat-tree  routing  algorithms  that 
perform  combining  were  known.  The  0(A  +  log  A)  step  routing  algorithm  presented  here  fills 
the  void. 

Only  slight  modifications  to  the  area-universal  fat-tree  are  necessary  to  make  it  volume 
universal[29].  The  underlying  structure  of  the  volume-universal  fat-tree  is  a  complete  8-ary 
tree.  Instead  of  doubling  at  each  level,  the  channel  capacities  increase  by  a  factor  of  4.  The 
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tree  ha*  m  level*,  root  capacity  M  =  22m,  and  A/3/2  =  23m  leave*.  The  switches  at  the  top  of 
a  channel  at  level  /  are  labeled  0  through  <1*  -  1.  Switch  k  at  level  /  i*  connected  to  switches  it, 
k  +  <1*,  it  +  2  •  4*,  and  k  -f  3  •  4*  at  level  /  + 1.  A  layout  with  volume  0(A/3/2  log3/2  M)  for  the 
fat-tree  can  be  obtained  by  embedding  it  in  the  three-dimensional  tree  of  meshes.  As  before, 
a  chain  of  size  0(log3/2  M )  can  be  attached  to  each  node  of  the  fat-tree  without  increasing 
the  asymptotic  layout  area  and  the  density  of  processors  can  be  improved  by  connecting  a 
lg'/J  M  x  lg1/2  Af  x  lg1/2  M  mesh  to  each  leaf. 

1.9  Sorting  on  butterflies 

In  this  scctiou  we  present  a  randomized  algorithm  for  sorting  N  Ig  Af  packets  on  an  A  lg  Af-node 
butterfly  network  in  O(logJV)  steps  using  constant-size  queues.  The  algorithm  is  based  on  the 
Ftashsort  algorithm  of  Keif  and  Valiant  [84].  The  main  difference  is  that  we  use  the  algorithm 
for  scheduling  packets  on  layered  networks  in  place  of  their  scheduling  algorithm,  which  require* 
queues  of  size  O(logAf).  A  similar  approach  has  been  suggested  previously  by  Pippenger  [76], 
and  Reif  (83). 

1.9.1  The  algorithm 

The  basic  outline  of  the  algorithm  is  the  same  a*  that  of  Flashsort.  The  first  step  is  to  randomly 
select  a  small  set  of  splitters  from  among  the  packets  that  arc  to  be  sorted.  Next  the  splitters 
are  sorted  deterministically.  The  splitters  partition  the  packets  into  intervals.  The  ith  interval 
consists  of  those  packets  whose  keys  are  larger  than  the  key  of  the  (i  -  i)st  largest  splitter,  and 
smaller  than  the  key  of  the  ith  largest  splitter.  (We  assume  without  lo6S  of  generality  that  all 
of  the  keys  arc  distinct.)  Using  the  splitters  as  guides,  each  interval  of  packets  is  routed  to  a 
different  subbutterfly,  where  it  is  sorted  recursively. 

We  begin  by  describing  a  recursive  algorithm  for  sorting  JV/lg®  N  packets  in  O(log  Af)  time 
on  an  Af  lg  Af-node  butterfly,  where  a  is  some  fixed  constant  larger  than  one.  The  butterfly  is 
“lightly  loaded”  by  this  factor  of  lg°+1  N  to  ensure  that,  with  high  probability,  at  the  lower 
levels  of  the  recursion  the  number  of  packets  to  be  sorted  by  each  subbutterfly  does  not  exceed 
the  number  of  inputs  to  that  subbutterfly.  When  the  algorithm  is  invoked,  each  packet  must 
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1.  Count  the  number  of  packets  entering  the  butterfly.  Let  the  number  of  packets  be  denoted 
by  n. 

2.  Randomly  and  independently,  make  each  packet  a  candidate  with  probability  v/A 7/n. 

3.  Sort  the  candidates  deterministically. 

4.  Select  every  Ig  Arth  candidate  to  be  a  splitter. 

5.  Distribute  the  splitters  for  splitter-directed  routing. 

6.  Route  each  packet  to  a  random  row  of  the  butterfly. 

7.  Route  each  interval  a  subbutterfly  via  splitter-directed  routing. 

8.  Distribute  the  packets  in  each  interval  to  distinct  inputs  of  the  corresponding  subbutter- 
flics. 

9.  Sort  the  intervals  recursively. 

Figure  1-15:  The  steps  performed  by  an  A/*input  butterfly  in  the  recursive  algorithm  for  sorting 
N/\g°  N  packets  in  0(log  N)  time  on  an  N  lg  AF-node  butterfly  using  constant-size  queues. 

reside  at  a  distinct  input.  As  we  shall  see,  this  algorithm  can  be  combined  with  Leighton’s 
Colurniuort  algorithm  [47]  to  sort  all  AFlgAF  packets  in  O(IogAF)  time. 

The  steps  taken  by  a  butterfly  with  A/  inputs  are  presented  in  some  detail  in  Figure  1-15. 
The  first  step  in  the  algorithm  is  to  count  the  number  of  packets  entering  the  butterfly. 
Since  the  packets  reside  in  distinct  inputs,  tire  total  number  of  packets  can  be  computed  via 
a  parallel  prefix  computation.  The  prefix  computation  can  be  performed  in  0(logA/)  time 
deterministically. 

Next  each  packet  independently  chooses  to  be  a  splitter  candidate  with  probability  y/M/n. 
As  we  shall  see,  with  high  probability  the  number  of  candidates  is  between  y/U /2  and  3 y/M /2. 
This  step  requires  only  constant  »iine. 

The  candidates  arc  then  sorted  in  0(logA/)  time  using  a  simple  deterministic  algorithm 
based  on  counting  [70,  84]. 

After  the  candidates  are  sorted,  every  (lg  Ar)th  one  in  the  sorted  order  is  chosen  to  be  a 
splitter.  This  oversampling  technique,  due  to  Reif,  ensures  that  each  of  the  intervals  contains 
approximately  the  same  number  of  splitters,  with  high  probability.  Note  that  we  oversample 
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by  a  factor  of  Ig  Af,  where  N  is  the  number  of  inputs  in  the  entire  network,  independent  of 
the  number  of  inputs,  A/,  of  the  butterfly  on  which  the  algorithm  is  invoked.  Since  with 
high  probability  there  arc  at  least  \/A7/21gA’  splitters,  the  subbuttcrflics  at  the  next  level  of 
recursion  should  have  at  most  2\/a7 Ig  N  inputs. 

Next  the  splitters  arc  distributed  throughout  the  butterfly  so  that  they  can  direct  each 
interval  of  packets  to  the  appropriate  subbutterfly.  We  distribute  a  copy  of  the  median  splitter 
to  each  node  in  level  0  of  the  butterfly.  Then  we  divide  the  splitters  into  upper  and  lower  halves. 
We  distribute  a  copy  of  the  median  splitter  from  the  upper  half  to  each  node  in  the  upper  half 
of  level  1.  Similarly,  we  distribute  a  copy  of  the  median  splitter  from  the  lower  half  to  each 
node  in  the  lower  half  of  level  1.  The  process  continues  in  this  fashion  until  all  of  the  splitters 
arc  used  up.  At  this  point,  every  node  in  the  first  8(log(\/A//logJV))  levels  of  the  butterfly 
has  a  copy  of  a  splitter.  This  step  can  be  performed  deterministically  in  0(log  M)  time. 

After  the  splitters  are  positioned,  each  packet  is  routed  to  a  random  row  of  the  butterfly. 
The  packets  are  scheduled  using  the  algorithm  for  routing  on  layered  networks. 

Each  interval  of  packets  is  then  routed  to  a  different  subbutterfly.  This  step  is  called  splitter- 
directed  routing  [8*1].  The  paths  of  the  packets  are  determined  as  follows.  At  level  0,  each  packet 
compares  itself  to  the  median  splitter.  If  it  is  larger,  it  moves  to  the  upper  half  of  the  second 
level,  otherwise  it  moves  to  the  lower  half.  The  process  is  repeated  at  the  level  1,  with  each 
packet  being  directed  to  the  appropriate  quarter  of  level  2,  and  so  on.  The  packets  are  scheduled 
using  the  algorithm  for  routing  on  layered  networks.  When  all  the  packets  have  been  routed 
along  in  the  butterfly  as  deeply  as  the  splitters  are  assigned,  each  subbutterfly  at  that  level 
picks  new  splitters  and  proceeds  recursively. 

The  last  step  before  the  recursive  call  is  to  position  the  packets  in  each  subbutterfly  in 
distinct  inputs.  The  problem  of  distributing  a  set  of  packets  to  distinct  destinations  is  known 
as  the  token  distribution  problem  [74].  On  an  M-input  butterfly  where  at  mo6t  c  packets  enter 
each  input,  A/  packets  can  be  distributed  deterministically  in  0(c-h  log  A/)  time. 

The  recursion  continues  until  cither  the  number  of  inputs,  M,  is  smaller  than  2>/**^,  or 
the  number  of  packets,  n,  is  smaller  than  \/M.  In  the  first  case,  the  sort  is  completed  using 
Batcher’s  odd-even  merge  sort.  An  A/*input  butterfly  can  sort  M  packets  in  0(log2A/)  time 
using  odd-even  merge  sort.  For  M  =  the  time  is  0(logAr).  In  the  second  case,  the 
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packets  can  be  sorted  deterministically  in  O(logAf)  time  by  the  same  technique  that  is  used  in 
step  four  to  sort  the  candidates. 

We  can  now  make  a  rough  estimate  of  the  running  time  of  this  algorithm.  Steps  1  and  2  arc 
performed  deterministically  in  O(logM)  time.  Assuming  that  there  are  0(\/a7)  candidates, 
Steps  3,  -4,  and  5  also  require  0(logA/)  time.  As  we  shall  see,  the  expected  time  for  Steps  6,  7 
and  8  is  O(logAf).  Although  these  steps  sometimes  take  longer  than  expected,  let  us  assume 
for  now  that  they  do  not.  In  this  case,  the  running  time  is  given  by  the  recurrence 

TUI)  <  1  3,(2'/Sri8A')  +  0(logW)  M>  2v/s« 

|  O(logtf)  U  <  2v/5j? 

which  has  solution  T(N)  =  O(IogrY). 

1.0.2  Analysis 

The  analysis  of  the  algorithm  is  broken  into  three  parts,  each  corresponding  to  a  different  use 
of  randomization  in  the  algorithm.  We  first  examine  the  use  of  randomization  in  selecting  the 
splitters.  We  show  that,  with  high  probability,  the  number  of  splitters  chosen  by  each  butterfly 
is  within  a  constant  factor  of  the  expectation  and  the  number  of  packets  Sn  each  interval  is 
smaller  than  the  number  of  inputs  to  the  butterfly  to  which  it  is  assigned.  Next,  we  bound  the 
probability  that  the  congestion  is  large  at  any  particular  switch  in  Steps  6  and  7.  Finally,  we 
show  that  if  the  packets  arc  scheduled  using  the  randomized  algorithm  for  layered  networks, 
then  it  is  unlikely  that  a  delay  of  more  than  0(log  N)  will  accumulate  over  the  course  of  the 
algorithm. 

1.9.3  Bounding  the  load 

The  first  step  in  the  ^  lalysis  is  to  show  that,  with  high  probability,  the  number  of  splitter 
candidates  chosen  by  each  butterfly  is  within  a  constant  factor  of  the  expectation.  We  say  that 
an  Af-input  butterfly  is  well-partitioned  if  the  number  of  splitter  candidates  chosen  is  between 
■n/aF/2  and  3\/a7/2.  The  3\Z3i7/2  upper  bound  ensures  that  the  candidates  can  be  sorted 
deterministically  by  the  butterfly  in  0(logA/)  time  and  the  y/JI/2  lower  bound  implies  that 
the  subbutterflies  at  the  next  level  of  recursion  will  have  at  most  2\ZA/!gJV  inputs.  If  all  of 
the  butterflies  are  well-partitioned,  then  the  algorithm  terminates  after  O(loglogJV)  levels  of 
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recursion.  (The  choice  of  1/2  and  3/2  as  the  coefficient*  of  \/JT  are  not  particularly  important. 
Other  constant*  would  serve  equally  well.) 

Lemma  22  For  any  fixed  constant  ki  there  is  a  constant  k3  such  that  the  probability  that  any 
butter/ly  with  at  least  ki  lg1  A'  inputs  is  not  well-partitioned  is  at  most  l/NkK 

Proof:  We  begin  by  considering  a  single  A/-input  butterfly  that  is  to  sort  n  packets.  Since 
each  packet  chooses  independently  to  be  a  candidate,  the  number  of  candidates  has  a  binomial 
distribution.  Let  S  be  the  number  of  successes  in  r  independent  Bernoulli  trials  where  each 
trial  has  probability  p  of  success.  Then  we  have  Pr(S  =  s)  -  Qp*(l  -  p)r~*.  We  estimate  the 
area  under  the  tails  of  this  binomial  distribution  using  a  ChernofT-type  bound  [18].  Following 
Angluin  and  Valiant  [4]  we  have 

PnS<7irp)  <  e-O-Hl’rWi 

Pr(.9>72rp]  <  e-(*-»)*fp/» 

> 

In  our  application  r  =  n,  p  =  \/A/ /n,  71  =  1/2,  and  7a  =  3/2.  For  any  fixed  constant  *3,  there 
is  a  constant  fcj  such  that  the  the  right-hand  sides  of  the  two  inequalities  sum  to  at  most  l/Nki 
for  M  >  JL-jlg2  JV. 

To  bound  the  probability  that  any  butterfly  is  not  well- partitioned,  we  sum  the  probabilities 
for  all  of  the  individual  butterflies.  Over  the  course  of  the  algorithm,  the  algorithm  is  invoked 
on  at  most  A  IgA  individual  butterflies.  Thus*  the  sum  is  at  most  lg  N/Nki~l.  For  any  Jt|, 
there  is  a  I3  such  that  this  sum  is  at  most  l/Nkl .  □ 

The  next  lemma  shows  that,  with  high  probability,  the  number  of  packets  in  each  interval 
is  at  most  a  constant  factor  times  its  expectation.  We  say  that  an  Af-input  butterfly  that  is 
assigned  n  packets  to  sort  is  a-split  if  every  interval  has  size  at  most  an  lg  N/y/M.  As  we  shall 
see,  if  every  butterfly  is  0(l)-split  and  there  arc  O(loglog  N)  levels  of  recursion,  then  by  lightly 
loading  the  butterfly  we  can  ensure  that  no  butterfly  is  assigned  too  many  packets  to  sort. 

Lemma  23  For  any  fixed  constant  k\  there  is  a  constant  ki  such  that  the  probability  that  every 
butterfly  is  ki-split  is  at  least  1  —  1/jV*1 

Proof:  We  begin  by  examining  a  single  packet  in  a  single  M-input  butterfly  that  is  to  sort 
n-packets.  To  show  that  a  packet  lies  in  an  interval  of  size  at  most  ^nlg  N/i/M  it  is  sufficient 


1.0.  SORTING  ON  BUTTERFLIES 


63 


to  show  that  both  following  and  preceding  it  in  the  sorted  order  at  least  IgjV  of  the  next 
JtjnlgiV/2'/X7  packet*  arc  candidate*. 

First  we  consider  the  packet*  that  follow  in  the  sorted  order.  The  number  of  candidates 
in  a  sequence  of  A:jnlgAF/2v/d7  packets  has  a  binomial  distribution.  For  r  =  Jkjnlg  Af/2%/357, 
p  =  v/XT/rf,  rp  ss  A-algA^/2,  and  =  2/jtj,  we  have  Pr[$  <  IgAP]  <  For  any 

A*3  we  can  make  the  right-hand  side  smaller  than  l/Nk)  by  choosing  Jtj  large  enough. 

The  calculations  for  the  packets  that  precede  in  the  sorted  order  are  identical.  The  prob¬ 
ability  that  fewer  IgA  of  the  preceding  *2 nig N/i'/Xf  packets  are  candidates  is  at  most 
1/A'kj.  Thus,  the  probability  that  an  individual  packet  lie*  in  an  interval  of  sixe  greater  than 
Uin\%Nf2y/Jf  is  at  most  2 /Nki. 

To  bound  the  probability  that  any  interval  in  the  butterfly  is  too  large  we  sum  the  proba¬ 
bilities  that  each  individual  packet  lies  in  an  interval  that  is  too  large.  Since  there  are  at  most 
JVlg  jV  packets,  this  sum  is  at  most  21g  N/Nk*~l. 

To  bound  the  probability  that  any  butterfly  is  not  A^-split,  we  sum  the  probabilities  that 
each  individual  butterfly  is  not.  Over  the  course  of  the  algorithm,  the  algorithm  is  invoked  on 
at  most  iVlgiV  butterflies.  The  sum  of  the  probabilities  is  at  most  21g 7  N/Nkz~7.  For  any 
constant  fcj,  we  can  make  this  sum  at  most  l/Nkl  by  making  Jkj  large  enough.  D 

The  remainder  of  the  analysis  is  conditioned  on  the  event  that  every  butterfly  is  well- 
partitioned  and  0(l)-split,  which  occurs  with  high  probability.  Two  technical  points  bear 
mentioning.  First,  Lemma  22  requires  that  the  number  of  inputs  to  every  butterfly  be  at  least 
A:a  lg3  N ,  where  kj  is  some  constant.  Since  the  recursion  terminates  when  the  number  of  inputs 
is  2'/**^,  N  must  be  large  enough  that  2v^*^  >  lg2  N.  Second,  both  Lemmas  22  and  23 
hold  independent  of  the  number  of  packets  to  be  sorted  by  each  butterfly.  Thus,  as  the  following 
lemmas  show,  we  can  adjust  the  load  on  the  butterfly  in  order  to  ensure  that  each  Af-input 
butterfly  receives  at  most  M  packets  to  sort. 

Lemma  24  The  number  of  levels  of  recursion  is  0(loglog  N). 


Proof:  At  each  level  of  recursion  the  number  of  inputs  drops  from  M  to  at  most  2y/M/\gN, 
until  the  number  of  inputs  reaches  2'fi***.  □ 
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Lemma  25  There  is  an  a  >  0  such  that  if  the  number  of  packets  to  he  sorted  is  tf/lg*  K,  then 
the  number  of  packets  are  assigned  to  any  M -input  butterfly  is  at  most  M. 

Proof:  Since  the  ratio  of  packets  to  inputs  is  l/lg0+1  N  at  the  top  level  of  the  recursion,  and 
increases  by  at  most  a  constant  factor  at  each  of  O(log  log  JV)  levels,  it  is  possible  to  choose  o 
such  that  at  the  bottom  level  It  will  be  at  most  one.  □ 

1.0.4  Bounding  the  congestion  at  each  switch 

The  second  step  in  the  analysis  is  to  bound  the  probability  that  too  many  packets  pass  through 
any  switch  in  Steps  6  and  7.  The  following  lemma  provides  a  bound  on  the  probability  that 
the  congestion,  c,  in  an  A/-input  butterfly  exceeds  lg  M  in  either  of  of  these  steps. 

Lemma  20  There  is  a  fixed  constant  fii  such  that  for  s  >  lg  M, 

pi(c£j]<  (^.y. 


Proof:  For  the  sake  of  brevity,  we  examine  Step  7  only.  A  similar  (ai»d  simpler)  analysis  holds 
for  Step  6. 

We  begin  by  counting  the  number  of  packets  that  can  possibly  use  a  switch.  Let  L  denote 
the  depth  of  an  Af-input  butterfly,  i.e.,  L  —  lg  M,  From  a  switch  at  level  I,  0  <  /  <  L,  2L~l 
rows  con  be  reached.  The  splitters  partition  these  rows  into  subbutterflics.  From  the  previous 
argument,  the  number  of  packets  that  enter  each  of  these  subbutterflics  is  at  most  the  number 
of  inputs,  with  high  probability.  Thus,  at  most  2l'~l  packets  can  pass  through  the  switch. 

Next  we  determine  the  probability  that  a  packet  that  can  pass  through  the  switch  actually 
does  so.  A  switch  at  level  l  can  be  reached  from  21  different  inputs.  Since  each  packet  begins 
in  a  random  input,  the  probability  that  it  can  reach  the  switch  is  2,”*\ 

The  number  of  packets,  S ,  that  pass  through  a  particular  "witch  at  level  /  has  a  binomial 
distribution.  The  number  of  trials  is  r  =  2L~I  and  the  probability  of  success  is  p  =  2?~ L. 
Thus,  Pr(5  =  s)  =  QL~l)  (l  -  .  Using  the  inequality  (\)  <  (ae/6)*,  we  have 

Pr[5  =  $]  <  (c/s)*.  For  s  >  1,  the  right-hand  side  decreases  by  at  least  a  constant  factor  with 
each  increase  of  1  in  s.  Thus  Pr[$  >  s]  <  0  ((e/s)*). 
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We  bound  the  congestion  in  the  entire  butterfly  by  summing  the  individual  probabilities 
over  all  2°W  switches  in  the  butterfly.  We  have 

Pr(c  >  a)  <  2°W  0y . 

For  4  >  L,  we  have  Pr(e  >  a]  <  (Pi/s)'  for  some  constant  Pi.  □ 

1.9.5  Bounding  the  cumulative  delay 

Since  a  subbutterfly  does  not  begin  to  execute  its  algorithm  until  the  larger  butterfly  at  the 
previous  level  of  recursion  is  finished,  delay  in  excess  of  the  time  allotted  to  each  butterfly 
accumulates  over  the  course  of  the  algorithm.  An  A/-input  butterfly  is  allotted  0( log  A/) 
time  to  perform  its  steps.  However,  Steps  6,  7,  and  8  are  not  guaranteed  to  terminate  in 
lime  O(logAZ).  It  is  tempting  to  try  to  prove  that  these  steps  terminate  quickly  with  high 
probability.  This  approach  fails  because  at  the  lower  levels  of  the  recursion  the  problem  size 
is  so  small  that  nothing  can  be  ascertained  with  high  probability.  Instead  we  must  argue  that 
although  delay  may  occur  at  any  particular  step,  it  is  unlikely  that  a  lot  of  delay  will  accumulate 
over  a  sequence  of  steps. 

The  delay  from  Step  8  is  relatively  easy  to  analyze.  This  step  requires  0(c  -f  I,)  time;  the 
delay  depends  only  on  the  congestion.  Lemma  26  bounds  the  probability  that  the  congestion 
is  large. 

There  are  two  possible  causes  of  delay  in  Steps  6  and  7.  A  poor  set  of  random  rows  for 
the  packets  can  cause  congestion  at  some  node,  which  guarantees  that  some  packet  will  arrive 
at  its  destination  late.  On  the  other  hand,  even  if  the  congestion  is  small,  a  poor  choice  for 
the  random  ranks  used  by  the  scheduling  algorithm  may  delay  a  packet.  The  following  pair  of 
lemmas  bounds  the  probability  that  the  delay  from  these  steps  is  large.  The  first  is  a  restatement 
of  the  main  scheduling  theorem  for  layered  networks.  It  bounds  the  probability  that  a  packet 
will  be  delayed  when  the  congestion  is  small.  The  second  puts  this  bound  together  with  the 
bound  that  the  congestion  is  large  from  Lemma  26. 

Lemma  27  For  a  bounded-degree  layered  network  with  L  levels  and  a  set  of  2°^  packets  whose 
paths  have  congestion  c,  there  is  a  fixed  constant  P2  such  that  the  probability  that  any  packet 
arrives  at  Us  destination  after  time  w,  w  >  L,  is  at  most  (fi2c/w)w. 
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Lemma  2ft  There  it  a  constant  fa  >  1  such  that  the  probability  Steps  6  and  7  require  more 
than  w  time  steps,  w  >  L,  is  at  most  2(1/ fa)'u. 

Proof:  For  the  sake  of  brevity,  we  examine  Step  7  only.  A  similar  analysis  holds  for  Step  6. 

We  break  the  analysis  into  two  eases  according  to  whether  the  congestion  is  small  or  large. 
Let  T  be  the  time  at  which  the  last  packet  arrives.  Then 

Pr(r>u>]  <  Pr|r  >  w|c  <  w/fafa)  +  Pr[c  >  w/fafa). 

We  use  Lemm*  27  to  bound  the  first  term  on  the  right.  Plugging  in  w/fafa  for  c  yields 
Pr(r  >  u;|c  <  w/fafa)  <  (l/fa)w.  We  use  Lemma  26  to  bound  the  second  term  on  the  right. 
Plugging  in  w/fafa  for  c  yields  Pr(c  >  w/fafa)  <  (P\fafa/w)w.  Since  u>  >  L  >  >/Ig^»  and 
Pi,  Pii  and  Pi  arc  constants,  uj  >  PiPiPl  for  sufficiently  large  N.  □ 

The  following  lemma  bounds  the  combined  delay  of  Steps  6,  7, 8. 

* 

Lemma  29  There  are  constants  Pi  and  Ps  >  l  such  that  the  probability  that  Steps  6,  7,  8 
together  require  time  P\L  +  u;  is  at  most  (l//?$)w. 

Proof:  Step  8  can  be  performed  dcteiministically  in  time  0(c  +  L).  From  Lemma  26  we  have 
Pr[e  >  s)  <  \Pi/s)\  for  $  >  L.  For  our  purposes,  a  weaker  bound  on  this  probability  suffices. 
Since  Pi  is  a  constant,  there  is  a  constant  ki  such  that  (i/Jkj)*  <  (Pi/s)‘  for  sufficiently  large 
L.  Combining  this  bound  with  that  of  Lemma  27  yields  the  desired  result.  □ 

To  complete  our  analysis  of  the  algorithm,  we  need  to  bound  the  probability  that  more  than 
O(logiV)  delay  accrues  during  the  sort. 

Lemma  30  For  any  fixed  constant  ki,  there  is  a  constant  ki  such  that  the  probability  that  the 
cumulative  delay  is  more  than  ki  lg  N  is  at  most  \/Nki . 

Proof:  The  cumulative  delay  at  the  bottom  level  of  the  recursion  is  the  sum  of  the  delay  at 
each  of  the  butterflies  on  the  branch  of  the  recursion  tree  from  the  top  level  to  the  leaf.  Let  D; 
be  the  delay  beyond  PiL  at  the  ith  level  of  the  recursion.  Then  Pr[J9;  =  to]  <  (l/fa)w.  Notice 
that  there  is  no  dependence  on  i  in  this  expression.  Let  D  be  the  cumulative  delay  on  a  branch 
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of  the  recursion  from  the  lop  level  to  a.  leaf.  Then  D  —  D{.  Generating  functions 

help  us  here.  The  generating  function  for  D{  is 

Gd.(:)  =  E  Pr(D,-  =  I.]*", 

where  xw  can  be  thought  of  as  a  place  holder.  To  sum  the  delay,  we  simply  multi¬ 
ply  the  generating  functions.  Thus,  the  generating  function  for  the  cumulative  delay  is 
G/>(x)  =  <?«.(*)•  The  coefficient  of  x«  in  GD (x)  is  For 

w  =  O(loglogN),  this  coefficient  is  at  most  (0(1)//?$)“'.  For  any  *3,  there  is  a  jbj  such  that 

Z“.hkdOWIPi)“  i«  >t  mosl  l/A’*1- 

To  bound  the  probability  that  the  cumulative  delay  exceeds  A-jlgiV  on  any  branch  of  the 
recursion,  we  sum  the  individual  probabilities  for  all  of  the  branches.  There  are  at  most  N 
branches.  Thus,  the  sum  is  at  most  l/Nki~l.  For  any  Jfcj,  there  S*  a  k$  such  that  this  sum  is  at 
most  1  /Nki.  .  □ 

1.9.6  Putting  it  all  together* 

Theorem  31  With  high  probability,  an  N  IgN-node  butterfly  can  sort  NlgjY  packets  in 
O(logN)  steps  using  constant-size  queues. 

Proofs  The  algorithm  for  sorting  iVlgrY  packets  on  an  AflgAT-nodc  butterfly  uses  the  algo¬ 
rithm  for  sorting  N/lg°  N  packets  as  a  subroutine.  First  each  packet  independently  chooses 
to  be  a  splitter  with  probability  l/lgft+1  N.  With  high  probability,  this  leaves  0(W/log°  N) 
candidates.  The  candidates  arc  sorted  using  the  subroutine.  Then  every  lgrVth  candidate  is 
selected  to  be  a  splitter,  leaving  Q(N/  log°+1  N)  splitters.  The  splitters  are  distributed  through¬ 
out  the  butterfly,  and  splitter-directed  routing  is  used  to  route  intervals  of  size  0(log°+J  JV) 
to  subbutterflies  with  0(logft+l  N)  inputs.  Now  each  interval  of  0(log0+2  N)  packets  resides 
in  a  group  of  0(logo+1  N)  butterfly  rows.  Each  of  these  rows  contains  O(logN)  packets.  The 
packets  in  each  row  can  be  sorted  in  O(logiV)  time  using  an  odd-even  transposition  sort.  With 
a  fixed  number  of  row  sorts  and  permutations,  all  of  the  packets  can  be  sorted  in  0(log  N)  time 
using  Columnsort.  □ 
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X.10  Counterexamples  to  on-line  algorithms 

This  section  presents  examples  where  several  natural  on-line  scheduling  strategics  do  poorly. 
Dascd  on  these  examples,  we  suspect  that  finding  an  on-line  algorithm  that  can  schedule  any 
set  of  paths  in  0(c  +  d)  steps  using  constant-size  queues  will  be  a  challenging  task. 

In  the  first  example,  wc  describe  an  iV-node  network  in  which  a  set  of  packets  with  con¬ 
gestion  and  dilation  0(1)  requires  fl(log2  Af/loglog  Af)  steps  to  be  delivered  using  the  strategy 
of  Section  1.3.  This  example  does  not  contradict  the  results  of  Section  1.3,  since  the  network 
has  ©(log2  A')  level*.  However,  it  shows  that  reducing  the  congestion  and  dilation  below  the 
number  of  levels  will  not  necessarily  improve  the  running  time. 

Observation  32  For  the  strategy  of  Section  1.3,  there  is  an  N-nodc  directed  acyclic  network 
of  degree  3  and  u  set  of  jmths  with  congestion  c  -  3  and  dilation  d  —  3  where  the  expected  length 
of  the  schedule  is  ©(log2  Af/ log  log  A'). 

Proof:  The  network  consists  of  many  disjoint  copies  of  the  subnetwork  pictured  in  Figure  1-16. 
For  simplicity,  we  dispense  with  the  initial  queues;  the  packets  originate  in  edge  queues.  The 
subnetwork  is  composed  of  k/logk  linear  chains  of  length  k,  where  k  shall  later  be  shown  to 
be  ©(log  A?).  The  second  node  of  each  linear  chain  is  connected  to  the  second  to  last  node 
of  the  previous  chain  by  a  diagonal  edge.  Wc  assume  that  at  the  end  of  each  edge  there  is  a 
queue  that  can  store  2  packets.  Initially,  the  queue  into  the  first  node  of  each  chain  contains  an 
end-of-stroam  (EOS)  signal  and  one  packet,  and  the  queue  into  the  second  node  contains  two 
packets.  A  packet’s  destination  is  the  last  node  in  the  previous  chain.  Each  packet  takes  the 
diagonal  edge  to  the  previous  chain  and  then  the  last  edge  in  the  chain.  Thus,  the  length  of 
the  longest  path  is  d  =  3. 

When  the  ranks  rlt,..  of  the  packets  ;>i,.-.,P3*/k>**  are  chosen  so  that  r;  <  r,-+i 

for  1  <  i  <  3fc/logfc,  packet  P3k/lo$k  requires  Cl(k7/\ogk)  steps  to  reach  its  destination.  The 
scenario  unfolds  as  follows.  Packets  ;»i  and  P2  take  a  diagonal  edge  in  the  first  two  steps.  These 
packets  cannot  advance  until  the  EOS  reaches  the  end  of  the  first  chain,  in  step  k.  In  the 
meantime,  ghosts  with  ranks  rj,  r2,  and  r3,  travel  down  the  second  chain,  but  packet  pz  blocks 
an  EOS  signal  from  traveling  down  the  chain.  Packets  p*  and  p$  are  waiting  for  this  EOS  signal. 
They  cannot  advance  until  step  2k.  In  this  fashion,  the  delay  is  propagated  down  to  packet 
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Figure  1-16:  Example  1. 


ft*/ lo<fc< 

A  simple  calculation  reveals  that  the  probability  that  r;  <  r;+i  for  1  <  i  <  3fc/  log  k  is 
l/2e(*).  Thus,  if  we  have  2eW  copies  of  the  subnetwork,  we  expect  the  ranks  of  the  packets 
to  be  sorted  in  one  of  them.  For  the  total  number  of  nodes  in  the  network  to  be  N,  we  need 
k  —  0(logAr).  In  this  case,  we  expect  some  packet  to  be  delayed  fl(log*  IV/loglogJV)  steps  in 
one  copy  of  the  subnetwork.  □ 

It  is  somewhat  unfair  to  say  that  the  optimal  schedule  for  this  example  has  length  0(c+d)  = 
0(1),  since  ghosts  and  EOS  signals  must  travel  a  distance  of  0(log  Ar).  However,  even  if  the 
EOS  signals  are  replaced  by  packets  with  the  appropriate  ranks,  the  dilation  is  only  O(logJV), 
and  thus  the  optimum  schedule  has  length  O(logAf). 

The  second  example  is  quite  general.  It  shows  that  for  any  deterministic  strategy  that 
chooser  v<>.  order  in  which  packets  pass  through  a  switch  independent  of  the  future  paths  of 
the  pac  there  is  a  network  and  a  set  of  paths  with  congestion  c  and  dilation  d  for  which  the 
schedule  produced  has  length  at  least  c(d  -  1)/  log  c.  This  observation  covers  strategies  such 
as  giving  priority  to  the  packet  that  has  spent  the  most  (or  least)  time  waiting  in  queues,  and 
giving  priority  to  the  packet  that  arrives  first  at  a  switch.  The  network  is  a  complete  binary 
tree  of  height  d  —  1  with  an  auxiliary  edge  from  the  root  to  an  auxiliary  node. 

Observation  33  For  any  deterministic  strategy  that  chooses  the  order  in  which  packets  through 
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«  switch  independent  of  the  paths  that  the  ]>ackcts  take  after  they  pass  through  the  switch,  there 
is  a  network  and  a  set  of  paths  with  congestion  c  and  dilation  d  for  which  the  schedule  produced 
has  length  c (d  -  l)/logc. 


Proof:  We  construct  the  example  for  congestion  c  anti  dilation  d ,  E(c ,  d ),  recursively.  The 
base  ease  is  the  example  E(c,logc  +  1).  Each  of  the  c  leaves  sends  a  packet  to  the  auxiliary 
node,  causing  congestion  c  in  the  auxiliary  edge.  The  network  for  E(c,d)  contains  c  copies  of 
the  network  for  E(c,d  -  logc).  First,  the  auxiliary  nodes  for  theses  copies  arc  paired  up  and 
merged  so  that  there  arc  c/2  auxiliary  nodes  each  with  two  auxiliary  edges  into  it.  Next,  the 
auxiliary  nodes  become  the  leaves  of  a  complete  binary  tree  of  height  logc  -  1  with  its  own 
auxiliary  node  and  edge.  For  each  copy  of  E(c,d  -  logc),  the  deterministic  scheduling  strategy 
chooses  some  packet  to  cross  its  auxiliary  edge  1"  *.  We  extend  the  path  of  this  packet  so  that 
it  traverses  the  auxiliary  edge  in  E{c,d).  The  dilation  of  the  new  set  of  paths  is  d  and  the 
congestion  c.  The  length  of  the  schedule,  T(c,d),  is  given  by  the  recurrence 


TM)  > 


r(c, d -log c)  +  logc-  1  +  c  d  >  logc d*  1 
logc  +  c  d  =  loge+l 


and  has  solution  T(c,d)  >  c(d-  l)/logc. 


□ 


The  third  example  shows  that  the  simple  look-ahead  strategy  of  giving  priority  to  the  packet 
with  the  farthest  distance  left  to  travel  fails  as  well. 


Observation  34  For  the  strategy  in  which  the  packet  with  the  farthest  distance  left  to  travel  (or 
the  farthest  total  distance  to  travel)  is  given  priority,  there  is  an  N -node  network  with  diameter 
0(y/N)  and  a  set  of  paths  with  congestion  0(y/N)  and  dilation  0(VN)  for  which  the  schedule 
produced  has  length  f l(N). 

Proof:  The  network  consists  of  k  linear  chains  labeled  0  through  Jfe  —  1.  Chain  i  is  composed 
of  -  2  -  i  nodes  labeled  0  through  3/;  -  3  -  ».  It  meets  chain  i  -f  1  at  node  k  —  l  —  i  and  at 
every  second  node  thereafter  up  to  node  k  -f-  i  -  1.  Figure  1-17  shows  the  network  for  k  =  4. 
We  assume  that  the  queue  the  end  of  each  edge  has  unlimited  size  and  that  at  each  step  a 
node  can  send  at  most  one  paci  et.  Initially,  the  first  node  of  each  chain  holds  k  packets.  The 
destination  of  each  of  these  packets  is  the  end  of  the  chain.  Note  that  packets  in  chain  i  have 
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Figure  1-17:  Example  3. 

higher  priority  than  those  in  the  chain  t  + 1  whenever  they  meet  since  the  chain  t  packets  must 
travel  one  step  farther  than  those  in  chain  i  +  1. 

The  key  to  this  example  is  that  the  packets  in  chain  t  +  1  are  delayed  by  all  of  the  packets 
in  chain  i  at  every  meeting  point  between  chains  t  and  i  +  1,  Since  the  packets  in  chain  0  are 
never  delayed  and  the  packets  in  chain  1  are  not  delayed  by  any  packets  other  than  those  in 
chain  0,  the  packets  in  these  two  chains  arrive  at  their  one  meeting  point  simultaneously.  At 
this  meeting  point,  the  packets  in  chain  G  have  priority  and  delay  the  packets  in  chain  1  by  k 
steps.  In  general,  the  packets  in  chains  i  and  »  +  1  arrive  at  meeting  point  j  simultaneously 
because  the  packets  in  chain  i  have  been  delayed  j  —  1  times  by  chain  i  —  1  and  the  packets  in 
chain  i  +  1  have  been  delayed  j  -  1  times  by  the  chain  i. 

The  claim  implies  the  theorem  for  k  =  y/N.  The  packets  in  chain  it  -  1  arc  delayed  by  k 
packets  at  each  of  k  -  1  meeting  points,  resulting  in  a  total  delay  of  fl(JV).  □ 

The  fourth  example  shows  that  the  natural  strategy  of  assigning  priorities  to  the  packets  at 
random  is  not  effective  either. 

Observation  35  For  the  stmtegy  of  assigning  each  packet  a  random  rank  and  giving  priority  to 
the  packet  with  the  lowest  rank,  there  is  an  N-node  network  with  diameter  O(log  N/  log  log  A) 
and  a  set  of  paths  with  dilation  d  =  0(log  W/loglog  JV)  and  congestion  c  =  0(log  W/loglog  N) 
where  the  expected  length  of  the  schedule  is  fl((log  iV/loglogA')3/2). 

Proof:  As  in  Example  1,  the  network  consists  of  many  copies  of  a  subnetwork.  Each  subnetwork 
is  constructed  so  that  d  =  c  =  kj  log  k.  A  subnetwork  consists  of  a  linear  chain  of  length  d, 
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d 


Figure  1*18:  Example  *1. 

with  loops  of  length  \/d  between  adjacent  nodes  (sec  Figure  1-18).  The  packets  are  broken  into 
\/d  groups  numbered  0  through  y/d  -  1  of  y/d  packets  each.  The  packets  in  group  i  use  the 
linear  chain  for  iy/d  steps  and  then  use  s/d-i  loops  as  their  path.  As  in  Example  3,  we  assume 
that  queues  have  unlimited  capacity  and  that  at  each  step  a  node  can  send  a  single  packet. 

If  the  random  ranks  arc  assigned  so  that  the  packets  in  group  t  have  smaller  ranks  than 
the  packets  in  groups  with  larger  numbers,  then  the  packets  in  group  i  delay  the  packets  in 
groups  with  larger  numbers  by  d  -  iy/d  steps.  Thus  the  last  packet  experiences  an  = 

O((b/log&)3/2)  delay. 

Once  again  the  ranks  of  the  packets  must  have  a  specific  order,  which  can  be  shown  to 
happen  with  high  probability  given  enough  copies  of  the  subnetwork.  As  in  Observation  32,  it 
is  not  hard  to  show  this  requires  k  =  ©(log  N).  □ 

1.11  Remarks 

The  scheduling  algorithm  from  Section  1.3  can  be  used  as  a  subroutine  in  algorithms  for  emu¬ 
lating  shared-memory  machines  on  bounded-degree  networks.  A  shared-memory  machine  with 
a  large  address  space  can  be  emulated  by  randomly  hashing  the  memory  locations  to  the  nodes 
of  a  butterfly  as  in  [35]  and  [81].  The  hashing  ensures  that  the  congestion  of  the  packets  im¬ 
plementing  each  memory  access  step  is  small.  The  algorithm  from  Section  1.3  can  be  used  to 
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schedule  the  the  movements  of  these  packets. 

The  algorithm  for  sorting  on  the  butterfly  with  constant-size  queues  can  modified  to  sort 
JkAf ' *  packets  on  a  ^-dimensional  mesh  with  side  length  A/  in  0(kM )  time  using  constant-size 
queues. 

Given  a  set  of  n  packets  whose  paths  have  congestion  c  on  a  layered  network  with  d  levels, 
a  setting  of  ranks  that  ensures  delivery  in  time  0(c  -f  d  +  log  n)  can  be  found  can  be  found 
off-line  deterministically  in  time  2°(c+rf+^*"l.  The  proof  uses  the  Raghavan-Spcncer  technique 
[78, 89]  to  sequentially  find  a  setting  of  the  ranks  so  that  no  bad  event  corresponding  to  a  delay 
sequence  occurs. 

One  application  is  in  preparing  simulations  by  volume  and  area- universal  networks  off-line  so 
that  no  random  bits  are  needed.  As  before,  the  first  step  is  to  map  the  processors  of  the  network 
to  be  simulated,  I),  to  the  processors  of  the  area-universal  network,  (/,  from  Section  1.8  using 
the  recursive  decomposition  strategy  from  [56].  Network  U  has  N  processors,  and  I)  has  area 
O(JV).  To  simulate  each  step  of  B,  network  U  must  route  a  set  of  n  =  0(JV/  log  N)  messages 
with  load  factor  A  =  0( log  IV).  The  second  step  is  to  find  paths  for  the  messages.  Since  these 
messages  link  the  same  processors  at  every  step  of  D,  it  is  sufficient  to  find  paths  once  ofT-Hnc. 
They  can  be  reused  over  and  over  during  the  simulation.  Given  a  set  of  n  messages  with  load 
factor  A,  it  is  possible  to  find  a  set  of  paths  with  congestion  c  =  0( A  -f  log  A/)  and  dilation 
d  =  0(logA/)  in  a  fat-tree  with  root  capacity  A/  off-line  deterministically  in  time  polynomial 
in  n  and  A/.  The  final  step  is  to  find  a  set  of  ranks  for  the  messages.  These  ranks  can  also 
be  reused  at  each  step  of  the  simulation.  Network  U  has  root  capacity  A/  =  Q(y/N/logN). 
Thus,  both  the  paths  and  the  ranks  for  the  packets  can  be  determined  ofT-line  deterministically 
in  time  polynomial  in  N  so  that  the  time  to  simulate  each  step  of  D  is  O(logJV). 

By  making  minor  modifications  to  the  definition  of  a  delay  sequence,  it  is  possible  to  prove 
that  not  only  docs  the  late  arrival  of  some  packet  imply  that  a  bad  event  occurs,  but  also  if 
a  bad  event  occurs  then  some  packet  is  delayed.  More  precisely,  some  packet  arrives  at  step 
d+w  where  w  =  m  +  qf  if  and  only  if  there  is  a  delay  sequence  of  length  /  <  d  +  2/  —  1  with 
m  +  qf  packets. 


CHAPTER  1.  PACKET  ROUTING  ALGORITHMS 


Chapter  2 


Distributed  random-access 
machines 


2.1  Introduction 

Underlying  any  realization  of  a  parallel  random-access  machine  (PRAM)  is  a  communication 
network  that  conveys  information  between  processors  and  memory  banks.  Yet  in  mod  PRAM 
models,  communication  issues  arc  largely  ignored.  The  basic  assumption  in  these  models  is  that 
in  unit  time  each  processor  can  simultaneously  access  one  memory  location.  For  truly  large 
parallel  computers,  however,  computer  engineers  may  be  hard  pressed  to  implement  networks 
with  the  communication  bandwidth  demanded  by  this  assumption,  due  in  part  to  packaging 
constraints.  The  difficulty  of  building  such  networks  threatens  the  validity  of  the  PRAM  as  a 
predictor  of  algorithmic  performance.  This  chapter  introduces  a  more  restricted  PRAM  model, 
which  we  cal!  a  distributed  random-access  machine  (DRAM),  to  reflect  an  assumption  of  limited 
communication  bandwidth  in  the  underlying  network. 

In  a  communication  network,  we  can  measure  the  cost  of  communication  in  terms  of  the 
number  of  messages  that  must  cross  a  cut  of  the  network,  as  in  [29]  and  [56].  Specifically, 
a  cut  S  of  a  network1  is  a  subset  of  the  nodes  of  the  network.  The  capacity  cap(S)  is  the 

This  chapter  describes  joint  research  with  Charles  Leiserson  [58]. 

’We  assume  that  in  the  communication  network,  each  processor  has  its  own  local  memory,  the  processors 
are  interconnected  as  a  graph,  and  routing  of  messages  is  performed  by  the  processors.  The  generalization  to 
the  Jase  when  processors,  memories,  and  switches  are  distinct  entities  is  straightforward,  but  complicates  the 
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number  of  wire*  connecting  processors  in  S  with  processors  in  the  rest  of  the  network  3,  i.e., 
the  bandwidth  of  communication  between  5  and  "5.  For  a  set  M  of  messages,  we  define  the  load 
of  M  on  a  cut  S  to  be  the  number  of  messages  in  At  whose  source  is  in  S  and  whose  destination 
is  in  3  or  vice  versa.  The  load  factor  of  M  on  S  is 


\/»i'  _ load(A/,$) 

KM'S)  -  ■  c.p (S)  • 


and  the  load  factor  of  M  on  the  entire  network  is 


A(Af)  =  maxA(A/,$). 

5 

The  load  factor  provides  a  simple  lower  bound  on  the  time  required  to  deliver  a  set  of  messages. 
For  instance,  if  there  arc  10  messages  to  be  sent  across  a  cut  of  capacity  3,  the  time  required 
to  deliver  all  10  messages  is  at  least  the  load  factor  10/3. 

There  arc  two  commonly  occurring  types  of  message  congestion  that  the  load  factor  measures 
effectively.  One  is  the  “hot  spot"  phenomenon  identified  by  Pfistcr  and  Norton  [75].  When 
many  processors  send  messages  to  a  single  other  processor,  large  delays  can  be  experienced  as 
messages  queue  for  access  to  that  other  processor.  In  this  situation,  the  load  factor  on  the  cut 
that  isolates  the  single  processor  is  high.  The  second  phenomenon  is  message  congestion  due  to 
pinboundedness.  In  this  ease,  it  is  the  limited  bandwidth  imposed  by  the  packaging  technology 
that  can  cause  high  load  factors.  For  example,  the  cut  of  the  network  that  limits  communication 
performance  for  some  set  of  messages  might  correspond  to  the  pins  on  a  printed-circuit  board 
or  to  the  cables  between  two  cabinets. 

The  load-factor  lower  bound  can  be  met  to  within  a  polylogarithmic  factor  as  an  upper 
bound  on  many  networks,  including  volume  and  area-universal  networks,  such  as  fat-trees 
[29,  56],  as  well  as  the  standard  universal  routing  networks,  such  as  the  Boolean  hypercube 
[96].  The  lower  bound  is  weak  on  the  standard  universal  routing  networks  because  every  cut 
of  these  networks  is  large  relative  to  the  number  of  processors  in  the  smaller  side  of  the  cut, 
but  these  networks  may  be  more  difficult  to  construct  on  a  large  scale  because  of  packaging 
limitations.  Networks  for  which  the  load  factor  lower  bound  cannot  be  approached  to  witl/n 
a  polylogarithmic  factor  as  an  upper  bound  include  linear  arrays,  meshes,  and  high-diameter 
networks  in  general. 


definitions. 
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In  the  PRAM  model,  the  issue  of  communication  bandwidth  does  not  arise  even  though 
most  parallel  computers  implement  remote  memory  accesses  by  routing  messages  through  an 
underlying  network.  In  the  PRAM  model,  a  set  of  memory  accesses  is  presumed  to  take  unit 
time,  reflecting  the  assumption  that  all  sets  of  messages  can  be  routed  through  the  network 
with  comparable  case.  In  the  DRAM  model,  a  set  of  memory  accesses  takes  time  equal  to  the 
load  factor  of  the  set  of  messages,  which  reflects  the  unequal  times  required  to  route  sets  of 
messages  with  different  load  factors. 

This  chapter  gives  DRAM  algorithms  that  solve  many  graph  problems  with  efficient  com¬ 
munication.  Our  algorithms  can  be  executed  on  any  of  the  popular  PRAM  models  because  a 
PRAM  can  be  viewed  as  a  DRAM  in  which  communication  costs  are  ignored. 

The  remainder  of  this  chapter  is  organized  a $  follows.  Section  2.2  contains  a  specification 
of  the  DRAM  model  and  the  implementation  of  data  structures  in  the  model.  The  section 
demonstrates  how  a  DRAM  models  the  congestion  produced  by  techniques  such  as  Urecursive 
doubling”  that  arc  frequently  used  in  PRAM  algorithms.  Section  2.3  defines  the  notion  of  a 
conservative  algorithm  as  a  concrete  realization  of  a  communication-efficient  DRAM  algorithm, 
and  gives  a  “Shortcut  Lemma”  that  forms  the  basis  of  the  conservative  algorithms  in  this 
chapter.  Section  2.4  presents  a  conservative  “recursive  pairing"  technique  that  can  be  used  to 
perform  many  of  the  same  functions  as  on  lists  as  recursive  doubling.  Section  2.5  presents  a 
linear-space  exclusive-read  exclusive-write  conservative  “tree  contraction"  algorithm  based  on 
the  ideas  of  Miller  and  Reif  [68].  Section  2.6  presents  treefir.  computations,  which  are  generaliza¬ 
tions  of  the  parallel  prefix  computation  {1G,  24,  71]  to  trees.  We  show  that  treefix  computations 
can  be  performed  using  the  tree  contraction  algorithm  of  Section  2.5.  Section  2.7  gives  short, 
efficient,  parallel  algorithms  for  tree  and  graph  problems,  most  of  which  are  based  on  treefix 
computations.  Section  2.8  explores  the  use  of  concurrent  reads  and  writes  in  DRAM  algorithms. 
Section  2.9  discusses  the  relationship  between  the  DRAM  model  and  more  traditional  PRAM 
models,  as  well  as  the  ramifications  of  using  the  DRAM  model  in  practical  situations. 

2.2  The  DRAM  model 

This  section  introduces  the  abstraction  of  a  distributed  random-access  machine  (DRAM).  We 
show  how  a  parallel  data  structure  can  be  embedded  in  a  DRAM,  and  we  define  the  load 
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factor  of  a  data  structure.  We  show  how  the  embedding  of  a  data  structure  in  a  network 
can  cause  congestion  in  the  underlying  network  when  the  pointers  of  the  data  structure  arc 
accessed  in  parallel,  and  we  also  demonstrate  that  a  parallel  algorithm  can  produce  substantial 
congestion  in  an  underlying  network,  even  when  there  is  little  congestion  implicit  in  the  input 
data  structure.  We  illustrate  how  a  DRAM  accurately  models  these  two  phenomena. 

A  DRAM  consists  of  a  set  of  n  processors.  All  memory  in  the  DRAM  is  local  to  the 
processors,  with  each  processor  holding  a  smalt  number  of  0(lgn)-blt  registers.  A  processor 
can  read,  write,  and  perform  arithmetic  and  logical  functions  on  values  stored  in  its  local 
memory.  It  can  also  read  and  write  memory  in  other  processors.  (A  processor  can  transfer 
information  between  two  remote  memory  locations  through  the  use  of  local  temporaries.)  Each 
set  of  memory  accesses  is  performed  in  a  memory  access  step,  and  any  of  the  standard  PRAM 
assumptions  about  simultaneous  reads  or  writes  can  be  made.  Our  algorithms  use  only  mutually 
exclusive  memory  references,  however,  so  these  special  cases  never  arise. 

The  essential  difference  between  a  DRAM  and  a  PRAM  is  that  the  DRAM  models  commu¬ 
nication  costs.  We  presume  that  remote  memory  accesses  are  implemented  by  routing  messages 
through  an  underlying  network.  We  model  the  communication  limitations  imposed  by  the  net¬ 
work  by  assigning  a  numerical  capacity  cap($)  to  each  cut  (subset  of  processors)  S  of  the 
DRAM  equal  to  the  number  of  wires  connecting  processors  in  S  with  processors  in  the  rest 
of  the  network.  Thus,  there  are  many  different  DRAM’s  corresponding  to  the  many  possible 
assignments  of  capacities  to  cuts.  For  a  set  M  of  memory  accesses,  we  define  load(Af,S)  to  be 
the  number  of  accesses  in  M  from  a  processor  in  5  to  a  processor  in  15  (the  rest  of  the  DRAM), 
or  vice  versa.  The  load  factor  of  M  on  S  is  A (M,S)  =  Ioad(A/,$)/cap(S),  and  the  load  factor 
of  M  on  the  DRAM  is  A(A/)  =  maxs  \{M,S). 

The  basic  assumption  in  the  DRAM  model  is  that  the  time  required  to  perform  o  set  M  of 
memory  accesses  is  the  load  factor  A  (A/).  (Local  operations  take  unit  time.)  This  assumption 
constitutes  the  principal  difference  between  the  DRAM  and  the  network  it  models.  We  know 
that  the  load  factor  is  a  lower  bound  on  the  time  required  in  both  the  network  and  the  DRAM. 
If  the  network’s  message  routing  algorithm  cannot  approach  this  lower  bound  as  an  upper 
bound  (for  example,  if  the  network  has  high  diameter),  then  the  network  is  not  well  modeled 
by  the  DRAM.  If  the  network’s  routing  algorithm  can  nearly  achieve  the  load  factor  as  an  upper 
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bound,  then  the  analysis  of  an  algorithm  in  the  DRAM  model  will  reliably  predict  the  actual 
performance  of  the  algorithm  on  the  network.  Section  2.9  discusses  some  network*  for  which 
the  DRAM  i*  a  reasonable  model,  including  volume-universal  network*  such  a*  fat. tree*  (5C). 

A  natural  way  to  embed  a  data  structure  in  a  DRAM  is  to  put  one  record  of  the  data 
structure  into  each  processor,  at  in  the  “data  parallel"  model  (33)-  The  record  can  contain  data, 
including  pointers  to  records  in  other  processors.  We  measure  the  quality  of  an  embedding  by 
treating  the  data  structure  as  a  set  of  pointers  and  generalising  the  concept  of  load  factor  to 
sets  of  pointers.  The  load  of  a  set  P  of  pointers  across  a  cut  S',  denoted  load^,.?),  is  the 
number  of  pointers  in  P  from  a  processor  in  S  to  a  processor  in  7,  or  vice  versa.  The  load 
factor  of  P  on  the  entire  DRAM  is 

un\  _  ^  l0*d(/\S) 

The  load  factor  of  a  data  structure  is  the  load  factor  of  the  set  of  its  pointers.  For  many 
problems,  good  embeddings  of  data  structures  can  be  found  in  particular  networks  for  which 
the  DRAM  is  a  good  abstraction  (sec  Section  2.9). 

There  are  generally  two  situations  in  which  message  congestion  can  arise  during  the  execu¬ 
tion  of  an  algorithm  on  a  network,  both  of  which  are  modeled  accurately  by  a  DRAM  whose  cut 
capacities  correspond  to  the  cut  capacities  of  the  network.  In  the  first  situation,  the  embedding 
of  a  data  structure  caures  congestion  because  many  of  its  pointers  cross  a  relatively  small  cut 
of  ih?  network.  A  parallel  access  of  the  information  across  those  pointers  generates  substantial 
message  traffic  across  the  cut.  In  the  second  situation,  the  data  strur*~re  is  embedded  with  few 
pointers  crossing  the  cut,  but  the  algorithm  itself  generates  substantial  message  traffic  across 
the  cut.  We  now  illustrate  these  two  situations. 


As  an  example  of  the  first  situation,  consider  an  embedding  of  a  simple  linear  list  in  which 
alternate  list  elements  arc  placed  on  opposite  sides  of  a  narrow  cut  of  a  network.  If  each  element 
fetches  a  value  from  the  next  element  in  the  list,  the  load  factor  across  the  cut  is  large.  In  the 
DRAM  model,  this  congestion  is  modeled  by  the  increase  in  time  required  for  the  memory 
accesses  across  the  cut.  (Observe  that  in  a  PRAM  model,  the  congestion  is  not  modeled  since 
any  set  of  memory  accesses  is  assumed  to  take  unit  time.)  Of  course,  a  list  can  typically  be 
embedded  in  a  network  so  that  the  number  of  list  pointers  crossing  any  cut  is  small  compared 
to  the  capacity  of  the  cut,  again  a  situation  that  can  be  modeled  by  a  DRAM. 


so 
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In  the  second  situation,  the  congestion  is  produced  by  an  algorithm.  As  an  example,  con¬ 
sider  the  “recursive  doubling”  or  “pointer  jumping"  technique  (101)  used  extensively  by  PRAM 
algorithms  in  the  literature.  The  idea  is  that  each  element  i  of  a  list  initially  has  a  pointer 
p(i)  to  the  next  element  in  the  list.  At  each  step,  element i  computes  p(i)  —  p(p(i)),  doubling 
the  distance  between  t  and  the  element  it  points  to,  until  it  points  to  the  end  of  the  list.  This 
technique  can  be  used,  among  other  things,  to  compute  the  distance  d(i)  of  each  element  t  to 
the  end  of  the  list.  Initially,  each  element  i  sets  d(i)  *-  1.  At  each  pointer-jumping  step,  each 
element  i  not  pointing  to  the  end  of  the  list  computes  d(t)  *-  d(i)+ d(p(i)).  In  a  PRAM  model, 
the  running  time  on  a  list  of  length  n  is  0(lg  n).  Variants  of  this  technique  arc  used  for  path 
compression,  vertex  numbering,  and  parallel  prefix  computations  [68,  88,  92,  101). 

We  now  show  that  recursive  doubling  can  be  expensive  even  when  a  <«ata  structure  has  a 
good  embedding  in  a  network.  Figure  2-1  shows  a  cut  of  capacity  3  separating  the  two  halves 
of  a  linked  list  of  16  elements.  In  the  first  step  of  recursive  doubling,  the  load  on  the  cut  is  only 
1  because  the  only  access  across  the  cut  occurs  when  element  8  accesses  the  data  in  element  9. 
In  the  second  step,  the  load  is  2  because  element  7  acccsres  element  9  and  element  8  accesses 
element  10.  In  the  third  step,  the  load  is  4,  and  in  the  fourth  step,  each  of  the  first  eight 
elements  makes  an  access  across  the  cut,  creating  a  load  of  8.  Since  the  load  factor  of  the  cut  in 
the  fourth  step  is  8/3,  this  set  of  accesses  requires  at  least  3  time  units.  Whereas  the  capacity  of 
the  cut  is  large  enough  to  support  the  memory  accesses  across  it  in  the  first  step,  by  the  fourth 
step,  the  cut  capacity  is  insufficient.  In  a  DRAM,  this  situation  is  modeled  by  the  increased 
time  to  perform  the  memory  accesses  in  the  fourth  step  compared  with  those  in  the  first  step. 

The  focus  of  this  chapter  is  avoiding  this  second  cause  of  congestion.  In  Section  2 A,  we 
shall  show  how  a  recursive  pairing  strategy  can  perform  many  of  the  same  functions  as  recursive 
doubling,  but  in  a  communication-efficient  fashion. 


2.3  Conservative  algorithms 

This  section  introduces  the  notion  of  a  conservative  algorithm.  In  the  DRAM  model,  a  conser¬ 
vative  algorithm  is  communication  efficient  in  the  sense  that  it  never  produces  more  congestion 
across  cuts  of  the  DRAM  than  is  implicit  in  the  input  data  structure.  We  give  an  important 
lemma  that  shows  how  pointers  in  a  data  structure  can  be  “shortcut”  without  introducing 
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Figure  '2-1:  A  cut  of  capacity  3  separating  two  halves  of  a  linked  list.  The  load  of  the  list  on 
the  cut  is  I.  At  the  final  step  of  recursive  doubling,  each  element  on  the  left  side  of  the  cut 
accesses  an  element  on  the  right,  which  induces  a  load  of  8  on  the  cut. 

congestion. 

A  contervati vc  algorithm  is  a  DRAM  algorithm  in  which  the  load  factor  of  memory  accesses 
in  any  step  is  bounded  by  the  load  factor  of  the  input  data  structure,  independent  of  the  cut 
capacities  of  the  DRAM  on  which  the  algorithm  is  executed.  To  be  precise,  we  define  a  set  M 
of  memory  accesses  to  be  conscrtxifiuc  with  respect  to  another  set  M‘  of  memory  accesses  if  for 
ali  cuts  S  of  a  DRAM,  we  have  load(A/,S)  <  load(M',5).  By  implication,  whatever  the  cut 
capacities  of  the  DRAM,  we  have  \(M)  <  A (M1).  We  make  the  natural  extension  of  the  term 
conservative  to  sets  of  pointers  and  data  structures.  A  conservative  algorithm  is  thus  one  all 
of  whose  memory  accesses  arc  conservative  with  respect  to  the  input  data  structure.  Thus,  if 
a  conservative  algorithm  runs  for  T  steps  on  an  input  data  structure  with  load  facte/  A,  then 
the  total  time  for  the  algorithm  is  at  most  AT. 

If  at  every  step,  the  memory  accesses  of  an  algorithm  co.  ;po»d  ro  a  subset  of  pointers 
in  the  input  data  structure,  then.  the.  algorithm  is  certainly  conservative  since  if  M  k  a  subset 
of  A/',  then  we  have  load(A/)  <  load (M1).  For  example,  synchronous  distributed  algorithms, 
such  as  the  network  flow  algorithms  of  Goldberg  and  Tarjan  [26,  27],  are  conservative  for  this 
reason.  We  do  not  wish  to  restrict  our  attention  to  this  limited  class  of  conservative  algorithms 
because  synchronous  distributed  algorithms  cannot  efficiently  solve  certain  problems  on  graphs 
with  high  diameter.  For  example,  the  problem  considered  earner  of  determining  the  distance  of 
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Figure  2*2:  The  Shortcut  Lemma.  In  caclt  of  the  four  cases  illustrated,  the  load  factor  across 
the  cut  is  cither  unchanged  or  diminished  by  replacing  a  b  and  b  -*  c  with  a  -*  c. 

each  element  to  the  end  of  the  list  cannot  be  solved  in  less  than  linear  time  with  asynchronous 
distributed  algorithm.  A  PRAM  algorithm,  however,  can  perform  such  the  computation  in 
logarithmic  time,  for  example,  by  recursive  doubling,  but  recursive  doubling  is  not  conservative. 

We  would  like  to  know  conditions  under  which  processors  in  a  DRAM  can  communicate 
directly  with  distant  locations  in  a  data  structure  without  increasing  communication  require¬ 
ments  as  measured  by  the  load  factor.  The  following  simple,  but  important,  lemma  provides 
conditions  that  arc  sufficient  for  any  DRAM. 

Lemma  36  (Shortcut  Lemma)  Supjme  a  set  P  of  jtoinlers  in  a  data  structure  contains 
jtoinlcrs  a  -*  b  and  b  -*  c.  Then  the  set  Q  of  pointers  defined  by 

Q  -  P  U  {a  -*  c)  —  {a  -+  b,b  c) 

is  conservative  with  respect  to  P.  Moreover,  any  set  Q  of  pointers  is  conservative  with  respect 
to  another  set  P  of  pointers  if  there  exist  jiointer-disjoint  paths  in  P  that  connect  the  endpoints 
of  pointers  in  Q. 

Proof:  We  show  only  the  first  part  of  the  lemma  since  the  second  part  follows  immediately  by 
induction.  We  shall  show  that  load(Q,S)  <  load(P,S)  for  any  cut  S  of  the  DRAM.  Consider 
the  eight  ways  in  which  a,  b ,  and  c  can  be  assigned  to  sides  of  the  partition  induced  by  a  cut  S. 
Half  the  cases  can  be  eliminated  by  symmetry  if  we  assume  that  a  is  on  the  left  side.  In  each  of 
the  four  remaining  cases,  the  load  across  the  cut  is  either  unchanged  or  diminished  when  a  — ►  b 
and  b  -+  c  are  replaced  with  a  -*  c,  as  is  shown  in  Figure  2-2.  □ 
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In  summary,  this  section  has  introduced  the  notion  of  a  conservative  algorithm.  An  upper 
bound  on  the  time  required  by  a  conservative  algorithm  can  be  determined  solely  from  the 
embedding  of  an  input  data  structure  on  the  DRAM.  If  the  number  of  steps  of  the  conservative 
algorithm  is  T  and  the  load  factor  of  the  input  data  structure  is  A,  then  the  total  time  is  at 
most  XT.  A  user  of  a  conservative  algorithm  therefore  need  only  minimize  the  congestion  of 
pointers  in  the  input  data  structure  across  cuts  of  the  DRAM  to  minimize  the  time  required  by 
the  algorithm.  If  the  embedding  of  the  data  structure  is  good,  that  is,  its  load  factor  is  small, 
then  a  conservative  algorithm  that  uses  a  small  number  of  steps  runs  fast. 

2.4  List  contraction 

In  this  section  we  present  a  conservative  “recursive  pairing”  algorithm,  Algorithm  LC,  that  can 
perforin  many  of  the  same  functions  on  lists  as  recursive  doubling.  The  idea  is  to  contract  an 
input  list  by  repeatedly  pairing  and  merging  adjacent  elements  of  the  list  until  c  \y  a  single 
clement  remains.  The  merges  arc  recorded  as  internal  nodes  of  a  binary  contraction  tret  whose 
leaves  arc  the  elements  in  the  input  list.  After  building  the  contraction  tree,  operations  uuch 
as  broadcasting  from  the  root  or  parallel  prefix  can  be  performed  in  a  conservative  fashion. 
Algorithm  LC  is  a  randomized  algorithm,  and  with  high  probability,  the  height  of  the  con¬ 
traction  tree  and  the  number  of  steps  on  a  DRAM  are  both  O(lgn),  where  n  is  the  number 
of  elements  in  the  input  list.  A  deterministic  variant  based  on  deterministic  coin  tossing  [20] 
runs  in  0(lgnlg*  m)  steps,  where  m  is  the  number  of  processors  in  the  DRAM,  and  produces 
a  contraction  tree  of  height  0(lg  n). 

The  recursive  pairing  strategy  is  illustrated  in  Figure  2-3  for  a  list  (A,jP,C,D,  E).  In  the 
first  step,  elements  B  and  C  pair  and  merge,  as  do  elements  D  and  E.  The  merges  arc  shown  as 
contours  in  the  figure.  A  new  contracted  list  (A,BC^DE)  is  formed  from  the  unpaired  element 
A  and  the  two  compound  elements  DC  and  DE.  After  the  second  step  of  the  algorithm,  the 
contracted  list  consists  of  the  elements  ABC  and  DE.  The  third  and  final  step  reduces  the  list 
to  the  single  element  ABCDE. 

In  Algorithm  LC,  the  contours  of  Figure  2-3  are  represented  in  a  data  structure  called 
a  contraction  tree.  The  leaves  of  the  contraction  tree  are  the  list  elements,  and  the  internal 
nodes  are  the  contours.  To  maintain  the  contraction-tree  data  structure,  the  algorithm  requires 
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constant  extra  space  for  each  element  in  the  input  list.  Each  processor  contains  two  elements: 
an  element  in  the  input  list,  and  a  s/xire  element  that  will  act  as  an  internal  node  in  the 
contraction  tree.  We  call  the  two  elements  in  the  same  processor  moles.  Each  element  holds 
a  pointer  to  an  unused  internal  node,  which  for  each  list  element  initially  points  to  its  mate. 
The  use  of  spare  nodes  allows  the  algorithm  to  distribute  the  space  for  the  internal  nodes  of 
the  contraction  tree  uniformly  over  the  elements  in  the  list.  (Sparc  internal  nodes  arc  used  in 
(M)  and  [55]  for  similar  reasons,  but  in  a  different  context.) 

We  now  describe  the  operation  of  Algorithm  LC,  which  is  illustrated  in  Figure  2-4  for  the 
example  of  Figure  2-3.  (A  description  in  pseudocode  can  be  found  in  (57).)  In  the  first  step,  each 
element  of  the  input  list  randomly  picks  either  its  left  or  right  neighbor.  An  element  at  the  left 
or  right  end  of  the  list  always  picks  its  only  neighbor.  If  two  elements  pick  each  other,  then  they 
merge.  The  merge  is  recorded  by  making  the  spare  of  the  left  element  of  the  pair  be  the  root  of 
a  new  contraction  tree.  The  spare  of  the  right  element  becomes  the  spare  for  the  root,  and  the 
elements  themselves  become  the  children  of  the  root.  The  roots  of  the  new  contraction  trees 
and  the  unpaired  list  elements  now  form  themselves  into  a  new  list  representing  the  contracted 
list,  upon  which  the  algorithm  operates  recursively. 

At  each  step  of  the  algorithm,  any  given  element  of  the  contracted  list  is  a  set  of  consecutive 
elements  in  the  input  list — a  contour  in  Figure  2-3.  The  set  is  represented  by  a  contraction- 
tree  data  structure  whose  leaves  are  the  elements  of  the  set  and  whose  internal  nodes  record 
the  merges.  When  the  entire  input  list  has  been  contracted  to  a  single  node,  the  algorithm 
terminates  and  a  single  contraction  tree  records  all  of  the  merges. 

To  describe  the  efficiency  of  randomized  algorithms  such  as  Algorithm  LC,  we  shall  some¬ 
times  say  that  an  algorithm  runs  in  0(T(u))  steps  “with  high  probability,”  by  which  we  shall 
mean  that  for  any  constant  k  >  0,  there  are  constants  c\  >  0  and  C2  >  0  such  that  with 
probability  1  -  c\[nk,  the  algorithm  terminates  in  at  most  ciT{n)  steps. 

Theorem  37  With  high  probability,  Algorithm  LC  takes  O(lgn)  steps  to  construct  a  contrac¬ 
tion  tree  for  a  list  of  n  elements. 

Proof:  We  show  that  the  algorithm  terminates  after  (Jb  + 1)  log^  n  iterations  with  probability 
at  least  1  -  1  fnk.  We  use  an  accounting  scheme  involving  “tokens”  to  analyze  the  algorithm. 
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Figure  2-3:  The  recursive  pairing  strategy  operating  on  a  list  (d,2?,C,  D,B).  Merged  nodes 
are  shown  as  contours,  and  the  nesting  of  contours  gives  the  structure  of  the  contraction  tree. 

Initially,  a  unique  token  resides  between  each  pair  of  elements  in  the  input  list.  Whenever  two 
list  elements  pick  each  other,  we  destroy  the  token  between  them.  For  each  token  destroyed, 
the  length  of  the  list  decreases  by  one,  and  the  algorithm  terminates  when  no  token  remains. 
In  any  iteration,  an  existing  token  has  probability  at  least  1/4  of  being  destroyed.  Thus,  after 
m  iterations,  a  token  has  probability  at  most  (3/4)m  of  remaining  in  existence,  bet  Ti  be  the 
event  that  token  i  exists  after  m  iterations,  and  let  T  be  the  event  that  any  token  remains  after 
m  iterations.  Then  the  probability  that  any  token  remains  after  m  iterations  is  given  by 


pr{T}  =  Pr{Tiur2u...ur„_,} 

<  Pr{Tj}  +  Pr{T2}  +  ...  +  Pr{rn_i} 


For  m  =  (k  +  1)  lo&j/3  n  iterations,  we  have 


Fr{T)  <  (n-l)(j) 


(*r+l)log4/jn 


□ 


Theorem  38  With  high  probability,  a  contraction  tree  constructed  by  Algorithm  LC  on  a  list 
of  v.  elements  has  height  O(lgn). 
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Figure  2-4:  The  operation  of  Algorithm  LC  on  the  example  of  Figure  2-3.  The  input  list  is 
( A,B,C,D,E ),  and  the  corresponding  spares  are  in  lower  case.  When  elements  B  and  C  pair 
and  merge  in  the  first  step  of  the  algorithm,  the  spare  b  becomes  the  root  of  a  contraction  tree 
with  leaves  B  and  C  to  represent  the  compound  node  BC.  The  spare  for  b  is  c.  At  the  end 
of  the  first  step,  the  list  consisting  of  the  elements  A,  b,  and  d.  represents  the  contracted  list 
( A,BC,DE ).  After  two  more  contraction  steps  of  Algorithm  LC,  the  input  list  is  contracted 
to  a  single  element  ABODE,  which  is  represented  by  a  contraction  tree  whose  root  is  c  and 
whose  leaves  are  the  elements  of  the  input  list  ( A,B,C,D,E ). 
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Proof:  The  height  of  the  contraction  tree  is  no  greater  than  the  number  of  iterations  of 
Algorithm  LC.  □ 

We  now  prove  that  Algorithm  LC  is  consetvativc. 

Theorem  SO  Algorithm  LC  is  conservative. 

Proof:  By  convention,  let  the  mate  of  an  element  in  the  input  list  lie  in  the  order  between  that 
element  and  its  right  neighbor.  The  key  idea  is  that  the  order  of  the  list  elements  and  their 
spares  is  preserved  by  the  merging  operation,  and  consequently,  after  each  contraction  step,  the 
pointers  in  the  contracted  list  correspond  to  disjoint  paths  in  the  original  list,  and  the  pointers 
between  elements  and  their  spares  also  correspond  to  disjoint  paths.  By  the  Shortcut  Lemma 
these  two  sets  of  pointers  arc  >  ach  conservative  with  respect  to  the  input  list,  and  since  each 
set  of  memory  accesses  in  a  contraction  step  of  the  algorithm  is  a  subset  of  one  of  these  two 
sets,  the  algorithm  is  conservative.  .  □ 

Although  a  contraction  tree  itself  is  not  conservative  with  respect  to  an  input  list,  it  can 
be  used  as  a  data  structure  in  conservative  algorithms.  For  example,  contraction  trees  can  be 
used  to  efficiently  broadcast  a  value  to  all  of  the  elements  of  a  list  and  to  accumulate  values 
stored  in  each  element  of  a  list. 

More  generally,  contraction  trees  are  useful  for  performing  prefix  computations  in  a  con¬ 
servative  fashion.  Let  V  be  a  domain  with  a  binary  associative  operation  •  and  an  identity 
e.  A  prefix  computation  [16,  24,  71}  on  a  list  with  elements  *i,xj,...  ,xH  in  V  puts  the  value 
y;  =  xj  •  xj  •  •  •  x;  in  element  i  for  each  t  =  1, 2, . . . ,  n. 

A  prefix  computation  on  a  list  can  be  performed  by  a  conservative,  two-phase  algorithm  on 
the  contraction  tree.  The  leaves  of  the  contraction  tree  from  left  to  right  are  the  elements  in 
the  list  from  Xj  to  xn.  The  first  phase  proceeds  bottom  up  on  the  tree.  Each  leaf  passes  its  x 
value  to  its  parent.  When  an  internal  node  receives  a  value  x/  from  its  left  child  and  a  value  zr 
from  its  right  child,  the  node  saves  the  value  z/  and  passes  z\  •  z~  to  its  parent.  When  the  root 
receives  values  from  its  children,  the  second  top-down  rhase  begins.  The  root  passes  e  to  its 
left  child  and  its  z\  value  to  its  right  child.  When  an  internal  node  receives  a  value  zp  from  its 
parent,  it  passes  zp  to  its  left  child,  and  passes  z\  •  zp  to  its  right  child.  When  a  leaf  receives  zp 
it  computes  y  —  zp  •  x. 
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The  number  of  steps  required  by  the  prefix  computation  is  proportional  to  the  height  of  the 
tree,  which  with  high  probability  is  O(Igu).  At  each  step,  the  algorithm  communicates  across 
a  set  of  pointers  in  the  contraction  tree,  all  of  which  arc  the  same  distance  from  the  leaves  in 
the  first  phase,  and  the  same  distance  from  the  root  in  the  second.  That  this  computation  is 
performed  in  a  conservative  fashion  is  a  consequence  of  the  following  lemma. 

Theorem  40  Let  CT  be  a  contraction  tree  computed  by  Algorithm  LC  on  an  input  list  L,  and 
suppose  P  is  a  subset  of  the  jtointers  o/CT.  If  no  pointer  in  P  is  an  ancestor  of  another  pointer 
in  j P,  then  P  is  conservative  with  rcs]Kct  to  L. 

Proof:  An  inordcr  traversal  of  CT  alternately  visits  list  elements  (leaves)  and  their  mates 
(internal  nodes)  in  the  same  order  that  the  list  elements  and  mates  appear  in  L.  Thus,  if  no 
pointer  in  P  is  an  ancestor  of  another  pointer  in  P ,  the  pointers  in  P  correspond  to  disjoint 
paths  in  L,  By  the  Shortcut  Lemma,  any  set  of  pointers  that  correspond  to  disjoint  paths  in 
the  list  L  are  conservative  with  respect  to  L.  □ 

Algorithm  LC,  which  constructs  a  contraction  tree  in  O(lgn)  steps,  is  a  randomized  al¬ 
gorithm.  By  using  the  “deterministic  coin  tossing”  technique  of  Cole  and  Vishkin  (20],  the 
algorithm  can  be  performed  nearly  as  well  deterministically  Specifically,  the  randomized  pair¬ 
ing  step  can  be  performed  deterministically  in  0(lg*  m)  sU.r-  on  a  DRAM  with  m  processors, 
where  lg’ m  is  the  number  of  times  the  logarithm  function  must  be  successively  applied  to  reduce 
m  to  a  value  at  most  1.  The  overall  running  time  for  list  contraction  is  thus  0(lg  n  lg*  m). 

As  a  final  comment,  we  observe  that  with  minor  modifications,  Algorithm  LC  can  be  used 
to  contract  circular  lists  with  the  same  complexity  bounds  as  for  linear  lists. 

2.5  Tree  contraction 

This  section  presents  a  conservative  tree  contraction  algorithm,  Algorithm  TC,  based  on  the 
tree  contraction  ideas  of  Miller  and  Reif  (G8).  The  algorithm  uses  a  recursive  pairing  strategy 
to  build  a  contraction  tree  for  an  input  binary  tree  in  much  the  same  manner  as  Algorithm  LC 
does  for  a  list.  With  high  probability,  the  height  of  the  contraction  tree  and  the  number  of  steps 
on  a  DRAM  are  both  O(lgn),  where  n  is  the  number  of  nodes  in  the  input  tree.  A  deterministic 
variant  runs  in  0(lgnlg*  m)  steps  and  produces  a  contraction  tree  of  height  O(lgn). 
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The  recursive  pairing  strategy  for  trees  is  illustrated  in  Figure  2-5  for  a  tree  with  nodes  4, 
D ,  C,  O,  E,  and  F.  In  the  first  step  nodes  A  and  B  pair  and  merge,  as  do  nodes  C  and  0\  the 
merges  arc  shown  as  contours  in  the  figure.  A  new  contracted  tree  is  formed  from  the  unpaired 
nodes  £  and  F,  and  the  compound  nodes  AB  and  CO.  In  the  next  step  of  the  algorithm,  node 
£  pairs  and  merges  with  CO  to  form  a  node  COE.  After  two  more  steps  the  6*nodc  input  tree 
has  been  contracted  to  a  single  node.  Notice  that  each  node  shown  as  a  contour  in  the  figure 
is  a  connected  subgraph  of  the  input  tree,  and  that  the  node  has  at  most  two  children  in  the 
contracted  tree. 

Algorithm  TC  represents  the  contours  of  Figure  2-5  in  a  contraction*tree  data  structure  in 
the  same  manner  as  Algorithm  LC  represents  the  contours  of  Figure  2*3.  Space  for  the  internal 
nodes  of  the  contraction  tree  Is  again  provided  by  spares.  Initially,  the  spare  or  each  node  in 
the  input  tree  is  its  mate,  an  unused  node  stored  in  the  same  processor. 

We  now  outline  Algorithm  TC  in  more  detail.  (A  description  in  pseudocode  can  be  found 
in  [57].)  In  the  first  step,  nodes  in  the  input  tree  arc  paired.  The  pairing  strategy  has  each 
node  pick  from  among  its  neighbors  according  to  how  many  children  it  has.  A  leaf  picks  its 
parent  with  probability  1.  A  node  with  exactly  one  child  picks  cither  its  child  or  its  parent, 
each  with  probability  1/2.  A  node  with  two  children  picks  either  child,  each  with  probability 
1/2.  The  root,  which  has  no  parent,  picks  its  children  with  equal  probability.  If  two  nodes  pick 
each  other,  then  they  merge.  The  merge  is  recorded  by  making  the  spare  of  the  parent  in  the 
pair  be  the  root  of  a  new  contraction  tree.  The  spare  of  the  child  in  the  pair  becomes  the  spare 
for  the  root,  and  the  parent  and  child  themselves  become  the  children  of  the  root.  The  new 
nodes  and  the  unpaired  nodes  form  themselves  into  a  new  tree  that  represents  the  contracted 
tree,  upon  which  the  algorithm  operates  recursively.  The  contracted  tree  is  binary  because  the 
pairing  strategy  ensures  that  no  node  with  two  children  pairs  with  its  parent. 

In  the  next  section,  we  shall  need  to  expand  a  contracted  tree  in  order  to  describe  treefix 
computations  recursively.  Expansion  consists  of  undoing  the  merges  in  the  reverse  of  the  order 
in  which  they  occurred.  From  the  time  that  a  parent  and  child  merge  to  the  time  that  the 
node  representing  their  merge  in  the  contraction  tree  expands,  the  pointers  of  the  pair  are 
undisturbed.  Consequently,  these  pointers  can  be  used  to  restore  the  pointers  of  the  neighbors 
of  the  pair  to  the  state  they  had  immediately  before  the  pair  merged.  To  ensure  that  the  merges 
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Figure  2-5:  The  recursive  pairing  strategy  operating  on  a  tree  with  nodes  A,B,C,DtE%  and  F. 
Merged  nodes  are  shown  as  contours,  and  the  nesting  of  contours  gives  the  structure  of  the 
contraction  tree. 

• 

arc  undone  in  the  exact  reverse  order,  as  is  assumed  in  the  next  section,  it  is  helpful  to  store 
in  each  internal  node  of  the  contraction  tree  the  step  number  in  which  the  merge  took  place. 
In  fact,  the  tree  can  be  expanded  by  a  greedy  strategy  without  consulting  the  number  of  the 
contraction  step  at  which  each  merge  occurred. 

The  proof  that  with  high  probability,  Algorithm  TC  takes  O(lgn)  steps  to  contract  an  input 
binary  tree  to  a  single  node  requires  three  technical  lemmas.  The  first  lemma  shows  that  in  a 
binary  tree,  the  number  of  nodes  with  two  children  and  the  number  of  leaves  are  nearly  equal. 
The  second  lemma  provides  an  elementary  bound  on  the  expectation  of  a  discrete  random 
variable  with  a  finite  upper  bound.  The  last  lemma  presents  a  ChernofT-typc  bound  [18]  on  the 
tail  of  a  binomial  distribution. 

Lemma  41  Suppose  T  —  (V,E)  is  a  rooted  binary  tree,  and  lei  Vo,  Vj  and  Vi  denote  the  sets 
of  nodes  in  T  (excluding  the  root),  with  :ero,  one,  or  two  children,  respectively,  and  lei  d(r)  be 
the  degree  of  the  root.  Then  we  have 


M  =  \Vi\  +  d(r). 


□ 
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Lc».:ma  42  Let  X  <  b  be  a  discrete  random  variable  with  expected  value  ft.  For  to  <  b,  we 
have 

Pr{A'  >  tu}  >  . 

1  b  -  w 

D 

The  final  lemma  present*  a  bound  on  the  tall  of  a  binomial  distribution.  Consider  a  set  of  t 
independent  Bernoulli  trials,  each  occurring  with  probability  p  of  success.  The  probability  that 
fewer  than  s  successful  trials  occur  is 

BMp)  -  £  [  !  |  vkil  -  p?~k  • 

*«o  \k  J 

The  lemma  bounds  the  probability  D(s,typ)  that  fewer  than  a  successes  occur  in  t  trials  when 
t *  <  t/2  and  p  <  1/2.  . 

Lemma  43  For  s  <  t/2  <iwf  p  <  1/2,  we  have 

□ 

With  these  lemmas  we  can  now  prove  that  with  high  probability,  Algorithm  TC  takes  O(Igu) 
ctcps  to  contract  a  rooted  binary  tree  to  a  single  node. 

Theorem  44  With  high  probability,  Algorithm  TC  takes  O(lgn)  contraction  steps  to  contract 
a  rooted  binary  tree  of  n  nodes  to  a  single  node. 

Proof:  The  proof  has  three  parts.  First,  we  use  Lemma  41  to  show  that  if  a  rooted  binary 
tree  has  |PJ  nodes,  the  expected  numbar  of  nodes  pairing  with  a  parent  in  a  single  contraction 
step  is  at  least  |F|  /4.  Next,  we  use  Lemma  42  to  show  that  the  probability  that  at  least  |F|  /8 
nodes  pair  with  a  parent  in  any  step  is  at  least  1/3.  Finally,  we  use  Lemma  43  to  show  for  any 
constant  k,  that  after  a  \ogs/7  n  steps,  for  some  constant  a  >  2,  the  probability  that  the  tree 
has  not  contracted  into  a  single  node  is  0(l/nk). 
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We  first  show  that  the  expected  number  of  nodes  pairing  with  a  parent  is  at  least  M/**t 
provided  that  M  >  1.  A  child  is  picked  by  its  parent  with  probability  1  when  Its  parent  is  a 
degree*  1  root,  and  1/2  otherwise.  Thus,  a  leaf  pairs  with  its  parent  with  probability  at  least 
1/2,  and  a  node  (other  than  the  root)  with  one  child  pairs  with  its  parent  with  probability  at 
least  1/4.  Let  P  be  the  number  of  nodes  pairing  with  a  parent.  Then  we  have 


E (P)  > 


2  +  4  ’ 


and  applying  Lemma  41  yields  the  desired  result: 

E(p)  >  Mi  +  Ml  +  Ml  +  ^r)  >  in 


Now  we  show  that  the  probability  that  at  least  M/8  nodes  pa>r  with  a  parent  in  a  single 
contraction  step  is  at  least  1/3.  We  call  such  a  step  successful.  At  most  half  of  the  nodes  pair 
with  their  parents.  Using  Lemma  42  with  b  a  M/2,  w  -  M/8,  P  ^  M/4»  we  have 


Pr 


1 

l£i  _  M  “  3  ‘ 
X  T 


Finally,  we  show  that  with  high  probability,  Algorithm  TC  takes  O(lgn)  contraction  steps 
to  contract  the  input  tree  to  %  single  node.  The  sire  of  the  tree  after  a  contraction  following  a 
successful  pairing  step  is  at  most  7/8  the  sire  before  the  contraction.  After  log^n  successful 
steps,  the  tree  must  consist  of  a  single  node.  By  Lemma  43,  the  probability  that  fewer  than 
l°gs/7  n  successful  steps  occur  in  alog8/-n  steps  is 


#(1°8s/7  n*  <* lo&j,*7  n»  1/3)  <  2((2/3)ftoe)u>*/'" 

-  2  n^niWot) 


For  any  value  k,  we  can  choose  a  large  enough  so  that  D0°gs/7  n,o  logjy-  n,  1/3)  =  0(l/nk). 
In  particular,  for  k  =  1  a  value  of  a  =  8  suffices.  □ 

We  now  prove  that  Algorithm  TC  is  communication  efficient  m  the  DRAM  model. 


Theorem  45  Algorithm  TC  is  conservative. 


Proof:  Each  node  of  a  contracted  tree  is  a  connected  subgraph  of  the  input  tree.  The  root  of 
the  contraction  tree  that  represents  the  subgraph  is  called  the  representative  of  the  subgraph. 
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The  representative  and  it*  *pare  are  each  either  a  node  of  the  subgraph  or  a  mate  of  a  node  of 
the  subgraph. 

Every  *ct  of  memory  accesses  performed  by  the  algorithm  is  of  one  of  two  types.  In  the 
first  type,  each  representative  of  a  subgraph  communicates  with  its  spare,  if  at  all.  In  the 
second  type,  each  representative  of  a  subgraph  communicates  with  the  representative  of  one 
of  its  children  in  the  contracted  tree.  In  either  of  these  two  cases,  the  set  of  memory  accesses 
corresponds  to  a  set  of  disjoint  paths  in  the  input  graph,  and  hence,  by  the  Shortcut  Lemma, 
is  conservative  with  respect  to  the  input  graph.  □ 

Tree  contraction  can  be  performed  conservatively  and  deterministically  on  a  DRAM  with 
m  processors  in  0(lgn!g*m)  steps  using  the  deterministic  coin-tossing  algorithm  of  Cole  *nu 
Vishkin  [20].  The  key  idea  is  that  in  Algorithm  TC,  the  nodes  that  can  pair  form  chains,  and 
by  Lemma  41  these  chains  contain  at  least  half  the  tree  edges.  The  chains  can  be  oriented  from 
child  to  parent  in  the  tree,  and  deterministic  coin  tossing  can  be,  used  to  perform  the  pairing 
step  in  0(lg*  m)  steps. 


2.6  Treeflx  computations 

This  section  presents  a  generalisation  of  the  parallel  prefix  computation  to  binary  tree*.  We 
present  two  kinds  of  tnccjix  computations— roof/ix  and  kajjiz — and  show  how  they  can  be  imple¬ 
mented  by  an  0(ign)-stcp  conservative  algorithm  that  uses  O(n)  space,  where  n  is  the  number 
of  nodes  in  the  input  tree.  As  we  shall  see  in  Section  2.7,  treeflx  computations  can  greatly  sim¬ 
plify  the  description  of  many  parallel  graph  algorithms  in  the  literature,  and  moreover,  treefix 
computations  can  be  performed  by  conservative  algorithms. 

We  begin  with  a  definition  of  treefix  computation. 

Definition.  Let  V  be  a  domain  with  a  binary  associative  operation  •  and  an  identity  e.  Let  T 
be  a  rooted,  binary  tree  in  which  each  vertex  i  6  T  has  an  assigned  input  value  x,-  €  V.  The 
rootfuc  problem  is  to  compute  for  each  vertex  i  6  T  with  parent  j,  the  output  value  y;  =  yj  ■  if, 
where  yj  =  e  if  t  is  the  root.  The  leaflix  problem  is  to  compute  for  each  vertex  i  €  T  with  left 
child  j  and  right  child  kt  the  output  value  y;  =  x;  •  yj  •  y*,  where  yj  =  e  if  t  has  no  left  child 
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ami  j ik  —  £  if  i  has  no  right  child. 

Simple  examples  of  treefix  problems  arc  computing  the  depth  of  each  vertex  in  a  rooted 
binary  tree  and  computing  the  sue  of  each  subtree.  These  and  other  examples  appear  in  the 
next  section. 

Like  the  prefix  computation  on  lists,  treefix  computations  can  be  performed  directly  on  the 
contraction  tree.  For  simplicity,  however,  we  describe  a  recursive  version. 

Theorem  46  Let  T  be  a  binary  tree  of  n  nodes  on  a  DRAM  with  m  processors.  A  rootfix  or 
lcaffix  computation  can  be  performed  on  T  by  a  conservative  randomized  algorithm  xrhich,  with 
high  probability,  takes  O(lgn)  steps,  or  by  a  conservative  deterministic  algorithm  which  takes 
C(lg  n  Ig*  m)  steps.  Both  algorithms  use  0(1)  space  per  node  of  the  tree . 

Proof:  Both  treefix  computations  are  performed  by  executing  a  single  contraction  step  on  the 
input  tree  T  to  produce  a  contracted  tree  X *.  Each  node  in  V  is  assigned  an  input  value,  and 
the  treefix  computation  is  executed  recursively  on  V.  The  contracted  tree  T  is  then  expanded 
to  yield  X  once  again,  and  the  output  value  of  each  node  in  X  is  computed  from  the  input 
values  of  X  and  the  output  values  of  X'. 

The  algorithm  for  lcaffix  is  based  on  each  node  i  maintaining  a  value  s;  which  has  the  form 
where  a;,bi,C{  6  V  are  elements  of  the  domain,  and  the  character  “u"  represents 
symbolically  a  slot  to  be  filled  in  with  a  value.  The  number  of  slots  is  equal  to  the  number  of 
children  of  the  node,  and  each  slot  corresponds  to  a  specific  child.  When  a  parent  and  child 
pair  during  the  course  of  the  lcaffix  algorithm,  the  value  of  the  child  is  substituted  into  the 
corresponding  slot  in  the  value  of  its  parent.  For  example,  suppose  node  »  pairs  with  its  right 
child  j,  where  the  value  of  i  is  s,  =  Ofutfuc;  and  the  value  of  j  is  sj  =  ajuibj.  The  value  a* 
of  the  merged  node  k  is  computed  from  s;  and  sj  by  substituting  sj  into  the  second  slot  in  a,-, 
yielding  the  value  sk  =  a,-ui6;  •  aji-ibj  •  c,\  The  •  operations  are  carried  out  immediately  so  that 
3k  has  the  proper  form. 

The  leaffix  algorithm  initializes  each  node  i  by  s,  <-x;t_»,or  s;  4-  x,-lji_i  depending 

on  the  number  of  children  of  node  i.  The  algorithm  then  proceeds  as  follows.  At  the  end  of 
a  contraction  step,  each  node  k  in  T  that  results  from  the  merging  of  parent  i  and  child  j 
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computes  Us  value  s*  by  substituting  sj  into  the  appropriate  slot  of  s,\  The  leaflix  algorithm  is 
then  performed  recursively  on  V  using  the  s  values  as  inputs  and  yielding  y  values  as  output. 
(The  y  values  contain  no  slots  and  arc  simply  elements  of  the  domain  V.)  During  the  expansion 
step,  the  parent  node  t  sets  y,*  —  y*.  Each  child  node  j  gets  its  output  value  yj  by  substituting 
the  y  values  of  its  children  into  the  slots  of  ij. 

In  the  rootfix  algorithm,  each  node  i  maintains  a  value  j,*,  as  in  the  Icaffix  algorithm,  but 
each  s;  now  has  the  general  form  s,-  =  ua;,  and  the  slot  of  a  node  corresponds  to  the  node’s 
parent.  The  rootfix  algorithm  initializes  each  node  i  by  a;  ♦-  ux*,  except  for  the  root  r  which 
performs  sr  —  xr.  After  the  pairs  have  been  determined  for  the  contraction  step,  each  node  j 
that  is  the  child  in  a  pair,  and  which  itself  has  a  child,  substitutes  sj  in  the  appropriate  slot 
of  its  child’s  value.  At  the  end  of  a  contraction  step,  each  node  k  in  V  that  resulted  from  the 
merging  of  parent  t  and  child  j  computes  its  value  by  **  *-  The  rootfix  algorithm  is  then 
performed  recursively  on  T*,  yielding  y  values  as  output.  During  the  expansion  step,  the  parent 
node  i  sets  y,- y*.  Each  child  node  j  gets  its  output  value  yj  by  substituting  y;  into  the  slot 
of  3;. 

The  time  and  space  bounds  claimed  in  the  theorem  arc  apparent  by  inspection.  Each  step 
of  a  treeftx  algorithm  adds  only  a  constant  amount  of  work  to  a  corresponding  step  in  the  tree 
contraction  and  expansion  algorithms.  The  additional  space  required  by  the  treeftx  algorithms 
is  the  0(1)  space  per  node  for  the  x,  y,  and  s  values.  □ 


2.7  Graph  algorithms 

This  section  presents  a  collection  of  conservative  DRAM  algorithms  for  solving  graph  problems. 
The  algorithms  use  two  processors  per  edge  of  an  input  graph  G  =  (K,£)  and  require  only 
constant  extra  space  in  each  processor.  Most  of  the  algorithms  use  treefix  computations  as 
subroutines. 

We  represent  each  vertex  in  an  undirected  graph  G  =  (V,  E)  by  a  doubly  linked  incidence 
ring  of  processors,  one  for  each  edge.  Each  element  of  the  incidence  ring  contains  pointers  to 
the  next  and  previous  elements  in  the  ring,  and  one  pointer  for  a  graph  edge.  For  each  edge 
(u,  v)  €  E ,  the  element  in  the  incidence  ring  for  u  contains  a  pointer  io  an  edge  element  in  the 
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incidence  ring  for  u,  and  vice  versa.  A  directed  graph  is  represented  in  the  same  doubly  linked 
fashion,  tut  the  graph  edges  arc  labeled  with  their  directions. 

We  represent  trees  with  arbitrary  vertex  degrees  by  an  incidence  ring  structure  as  well.  If  the 
tree  is  directed,  each  ring  Isas  a  unique  principal  element  that  points  toward  the  root.  Breaking 
the  incidence  ring  before  the  principal  element  yields  the  standard  binary-tree  representation 
of  the  tree  (39,  pp.  3S2-333). 

We  now  present  brief  descriptions  of  the  algorithms.  The  performance  is  given  in  terms  of 
the  number  of  steps  on  a  DRAM  when  the  input  representation  has  she  n.  We  assume  the 
implicit  trc  contractions  in  the  algorithms  arc  performed  by  the  randomized  Algorithm  TC. 
Deterministic  bounds  can  be  obtained  by  multiplying  the  number  of  steps  by  0(lg*  tn),  where 
in  is  the  number  of  processors.  An  upper  bound  on  the  time  required  in  the  DRAM  model  can 
be  obtained  by  multiplying  the  number  of  steps  by  the  load  factor  of  the  input. 

Generalised  treefix.  Perform  a  treefix  operation  on  a  directed  tree  with  arbitrary  vertex 
degree.  The  input  values  {x,*}  are  stored  in  the  principal  elements  of  the  tree,  which  is  where 
the  output  values  {y,}  are  to  be  placed.  The  Icaflix  value  at  a  node  i  whose  children  have  values 
<Ji,  J/2*  •  •  •  w  y;  =  x;  •  yi  •  yj « •  •  y *.  Each  non-principal  element  is  assigned  the  identity  c  for 

its  value.  A  binary  treefix  computation  performed  on  the  binary  tree  representation  underlying 
the  tree  computes  the  desired  values.  Performance:  O(lgn). 

Tree  functions.  Given  a  directed  tree,  compute  for  each  r.'-'ie  the  number  of  descendants, 
Us  height,  or  its  depth.  The  number  of  descendants  for  each  node  can  be  computed  by  a  Icaflix 
computation  with  •  as  integer  addition  and  x,-  =  1  for  all  nodes.  The  height  of  a  node  can  also 
be  computed  by  a  Icaflix  computation  where  a-b  =  max(«  +  1,6  +  1),  the  identity  is  r  -  -l, 
and  x;  =  —  1  for  all  nodes.2  The  depth  of  a  node  can  be  computed  by  a  rootflx  computation 
with  •  as  addition  and  x;  =  1  for  all  nodes  except  the  root  which  has  value  0.  Pcrfomancc: 
0(lg»). 

Rooting  an  undirected  tree.  Pick  a  root  of  a  tree  with  undirected  graph  pointers,  and 
orient  the  graph  pointers  toward  the  root.  Form  an  “Eulerian  tour"  of  the  pointers  of  the 
representation  [92]  by  directing  each  element  of  the  tree  to  link  its  incoming  ring  pointer  with 

technically,  e  =  — 1  is  not  an  identity  lor  the  operation  a  b  =  max(a  +  1,6  + 1).  Nonetheless,  this  Icaflix 
computation  correctly  computes  the  height  of  each  node  in  a  binary  tree.  Moreover,  this  Icaflix  computation  also 
generalizes  to  a  directed  tree  with  arbitrary  vertex  degree. 
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its  graph  edge  directed  outward  and  its  graph  edge  directed  inward  with  its  outgoing  ring 
pointer.  Each  graph  edge  is  used  twice  in  the  tour,  once  in  each  direction,  but  each  ring  pointer 
is  used  only  once.  Use  Algorithm  LC  to  form  a  contraction  tree  of  the  tour.  Choose  the  root 
of  the  contraction  tree  to  be  the  root  of  the  tree,  and  break  the  tour  so  that  it  begins  with  the 
root.  Use  parallel  prefix  to  number  each  node  according  to  its  first  occurrence  in  the  tour.  Use 
contraction  trees  to  distribute  the  smallest  value  in  each  incidence  ring  to  the  elements  of  the 
ring.  Orient  each  graph  edge  from  the  larger  value  to  the  smaller.  Performance:  0(lg  n). 

Rerooting  a  directed  tree.  Given  a  directed  tree  and  another  distinguished  vertex  k, 
reorient  the  graph  edges  of  the  tree  to  point  toward  k.  The  algorithm  for  rooting  a  tree  can  be 
used  by  picking  k  as  the  root  instead  of  the  root  of  the  contraction  tree,  but  a  single  treefix 
computation  suffices.  Perform  *\  lcaflix  computation  with  x*  =  1  and  x;  =  0  if  i  /  and  use 
Boolean  OR  for  *.  Each  principal  element  whose  leaffix  value  is  1  lies  on  the  path  from  x*  to 
the  root.  Reverse  the  direction  of  the  graph  pointers  of  these  elements.  (Note:  rerooting  a  tree 
changes  the  principal  elements.)  Performance:  O(lgn). 

Tree-walk  numberings  of  a  binary  tree.  Humber  the  nodes  of  a  binary  tree  according 
to  th-  -  they  would  be  visited  in  a  preorder/inorder/postorder  tree  walk.  For  each  of  the 
walks,  .11  compute  y *,  the  number  of  nodes  visited  before  the  left  subtree  of  k.  Use  a  leaflix 

computation  to  compute  the  number  si:*k  of  the  subtree  rooted  at  k.  We  first  compute  the 
preorder  numbering.  (For  the  purposes  of  these  numbering  algorithms,  we  consider  the  root  to 
be  a  left  child.)  If  node  A;  is  a  left  child,  set  X*  *-  1.  If  node  k  is  a  right  child,  set  x*  to  the 
size  of  its  sibling  subtree  plus  1.  A  rootfix  computation  with  •-  yields  y*,  which  is  the  preorder 
numbering  of  node  k.  The  inordcr  numbering  can  be  computed  similarly.  If  node  k  is  a  left 
child,  set  x*  *-  0.  If  k  is  a-  right  child,  set  x*  to  the  size  of  its  sibling  subtree  plus  1.  Compute  yk 
for  each  node  using  a  rootfix  computation  with  +.  The  inorder  numbering  of  node  k  is  y*  plus 
the  size  of  its  left  subtree  plus  I  The  postfix  numbering  can  be  computed  by  setting  x*  «-  0 
if  node  k  is  a  left  child,  and  by  setting  x*  to  the  size  of  its  sibling  subtree  if  k  is  a  right  child. 
After  computing  y*  using  a  rootfix  computation  with  +,  the  postfix  numbering  of  node  k  is  y* 
plus  the  sizes  of  its  two  subtrees  plus  1.  Performance:  O(lgn). 

Prefix  and  postfix  numberings  of  a  directed  tree.  Number  the  edges  of  an  arbitrary 
directed  tree  according  to  the  order  they  are  visited  in  a  preorder /postorder  tree  walk.  The 
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problem  reduces  to  prefix/postfix  numbering  on  the  underlying  binary  tree  representation. 
Performance:  O(lgn). 

Diameter  and  center  of  a  tree.  The  diameter  is  the  length  of  the  longest  path  in  the  tree. 

A  center  is  a  vertex  v  such  that  the  longest  path  from  v  to  a  leaf  is  minimal  over  all  vertices 
in  the  tree.  The  diameter  can  be  determined  by  rooting  the  tree  and  using  rootfix  to  find  the 
farthest  leaf  from  the  root,  lteroot  the  tree  at  this  leaf.  The  distance  from  the  new  root  to 
the  farthest  leaf  is  the  diameter.  (This  algorithm  is  based  on  an  analog  algorithm  attributed 
to  3.  Wcnnmackcr  (23].)  A  center  of  the  tree  can  be  determined  by  finding  a  median  element 
of  the  path  that  realizes  the  diameter.  Performance:  O(lgn). 

Centroid  of  a  tree.  A  centroid  is  a  vertex  v  such  that  the  largest  subtree  with  v  as  a  leaf 
is  minimal  over  all  vertices  in  the  tree.  A  centroid  can  be  determined  by  rooting  the  tree  and 
computing  the  size  of  each  subtree.  By  broadcasting  the  size  m  of  the  tree  from  the  root,  each 
graph  edge  in  each  incidence  ring  can  determine  the  number  of  elements  on  the  other  side  of 
the  edge.  For  each  incidence  ring,  compute  the  maximum  of  these  values.  A  vertex  with  the 
minimum  of  these  maximum  values  is  a  centroid.  Performance:  O(lgn). 

Separator  of  a  tree.  A  separator  [62]  is  a  partition  of  the  vertices  of  an  n-vertex  tree  into 
three  sets  A,  B,  and  C,  with  \A\  <  §n,  |J3|  =  1,  and  \C\  <  §n,  such  that  no  edge  of  the  tree 
goes  between  a  vertex  in  A  and  a  vertex  in  C.  Determine  a  centroid  of  the  tree.  This  vertex 
is  the  separator  vertex  in  B.  It  remains  to  partition  the  remaining  vertices  between  A  and  C. 
For  cadi  graph  edge  in  the  incidence  ring,  count  the  number  of  vertices  in  the  subtree  on  the 
other  side  of  the  edge.  Put  the  largest  subtree  in  A.  Use  parallel  prefix  on  the  incidence  ring  to 
compute  a  running  sum  of  the  sizes  of  the  other  subtrees.  Put  all  subtrees  whose  prefix  value 
is  at  most  \n  in  C,  and  put  the  remainder  in  A.  Performance:  C(lgn). 

Subexpression  evaluation.  Given  n  directed  tree  in  which  each  leaf  has  a  value  and  each 
internal  node  has  an  operator  from  {+,-,•,*},  compute  for  each  internal  node  the  subexpres¬ 
sion  rooted  at  that  node.  A  single  leaffix-like  computation  suffices  using  the  ideas  of  Brent  [15] 
and  Miller  and  Reif  (68].  Performance:  O(lgn). 

Minimum-cost  spanning  forest.  A  spanning  forest  of  an  undirected  graph  G  =  (V,E)  is 
a  maximal  set  F  C  E  of  edges  that  contains  no  cycles.  Given  an  undirected  graph  G  -  (F,  E) 
and  a  cost  function  w  :  E  -*  R,  determine  a  sjxinning  forest  F  such  that  the  sum  of  the  costs 
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(weight*)  of  the  edges  in  F  is  minimum.  We  give  a  conservative  DRAM  implementation  of 
Boruvka’s  algorithm,  also  attributed  to  Solliu  (91,  pp.  71-83).  We  assume  without  loss  of  gen¬ 
erality  that  the  edge  weights  arc  distinct-otherwise,  we  can  assign  the  weight  of  a  graph  edge 
c  between  two  incidence-ring  elements  with  addresses  a  and  b  to  be  (w(c),  max  (a,  6),  min(c,&)) 
and  then  compare  weights  lexicographically.  We  determine  F  by  marking  edges  in  G.  Initially, 
no  edges  arc  marked.  At  each  step  of  the  algorithm,  the  currently  marked  graph  edges  form 
a  subforcst  of  F.  Break  cacli  incidence  ring  by  removing  a  single  ring  pointer,  and  direct  the 
resulting  linear  list.  At  each  step  of  the  algorithm,  the  marked  graph  edges  and  the  ring  pointers 
form  a  set  {'!)}  of  rooted  trees,  where  the  index  i  of  the  tree  is  the  address  of  the  root.  The 
algorithm  proceeds  as  follows.  For  each  tree  7;,  use  a  rootfix  computation  to  broadcast  i  to  all 
of  the  elements  in 7).  Use  a  leaffix  computation  on  Ti  to  determine  an  edge  e  €  B  connecting  an 
edge  element  u  €  'i';  with  an  edge  element  v  €  7),  where  i  j,  with  smallest  weight.  If  no  such 
edge  exists,  the  algorithm  terminates.  If  Tj  picks  the  same  edge  ax  7;,  the  tree  with  smaller 
index  does  nothing.  Otherwise,  mark  e  as  a  member  of  F ,  directing  it  into  Tj,  and  reroot  7) 
with  u  as  the  new  root.  Repeat  this  procedure  until  the  algorithm  terminates.  Performance: 
0(  lg2n). 

Connected  components.  Given  an  undirected  input  graph  G  —  (V,  E),  determine  a 
labeling  l  :  V  —*  Z  such  that  l(v)  =  l(v')  if  and  only  if  v  and  v'  arc  in  ihe  saw*  connected 
component  of  G.  The  algorithm  is  the  same  as  the  minimum  spanning  tree  algorithm,  choosing 
the  weight  of  a  graph  edge  e  between  incidence  ring  elements  with  addresses  a  and  b  to  be 
(inax(a,fc),min(n,i»)).  The  label  of  a  vertex  is  the  index  of  its  tree.  Performance:  0(lg2n). 

Biconnected  components.  Two  edges  of  an  undirected  graph  G  —  (V,  B)  are  in  the  same 
biconncclcd  component  if  they  lit  on  a  common  simple  cycle.  Determine  a  labeling  t :  E  -*  Z 
such  that  /(c)  =  l(c')  if  and  only  if  e  and  e'  are  in  the  same  biconnected  component  of  G.  We  give 
a  conservative  DRAM  implementation  of  tin  biconnectivity  algorithm  of  Tarjan  and  Vishkin 
(92).  We  assume  that  the  reader  has  sme  familiarity  with  that  algorithm.  Find  a  (directed) 
minimum  spanning  tree  T  =  (V,F)  of  G.  Number  the  vertices  in  the  minimum  spanning  tree 
in  preordcr.  Use  leaffix  computations  to  compute  for  each  vertex  v  three  values:  nd(u),  low(v), 
and  high(u).  The  value  nd(v)  is  the  number  of  descendants  of  v,  and  low(u)  (high(v))  is  the 
lowest  (highest)  numbered  vertex  (with  respect  to  the  preorder  numbering  of  T)  that  is  either 
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a  descendant  of  v  or  adjacent  to  a  descendant  of  u  by  an  edge  of  E  -  F.  Build  a  new  graph  C?' 
where  the  edges  of  F  are  the  vertices  of  G1.  Let  c  be  an  edge  from  u  to  p(u),  where  j{u)  is  the 
parent  of  u  in  F.  The  adjacency  ring  for  u  in  G  acts  as  the  adjacency  ring  for  c  in  G'.  Add 
two  kinds  of  edges  to  G #.  For  each  edge  {u>,  v)  in  B  -  F  such  that  v-{-  nd(u)  <  w,  add  an  edge 
{{f,Ku)}i{u,iK,°)}}  10  O'.  For  each  edge  (v,Ku))  F  *uch  that  v  ^  1  and  p(v)  £  1,  and 
low(u)  <  v  or  high(u)  >  p{u)  +  »<l(p{w)),  add  an  edge  {{«,K^)}»{Ku)»Kp{u))}}  10  It  Cfcn 
be  verified  that  the  representation  of  G'  is  conservative  with  respect  to  the  representation  of 
G.  Find  the  connected  components  of  G‘.  Two  edges  of  F  arc  in  the  same  block  if  as  vertices 
in  G'  they  arc  in  the  same  connected  component.  Finally,  for  each  edge  e  =  {u>,  u)  in  B-  F, 
let  /(c)  =  /({ u»,p(uj)}).  Performance:  0(lg3  n). 

Eulerian  cycle.  An  Eulerian  cycle  of  an  undirected  graph  G  —  (V,  B)  it  a  cycle  containing 
each  edge  in  E  exactly  once.  If  any  vertex  has  odd  degree,  then  no  Eulerian  cycle  exists.  Form  a 
set  of  disjoint  cycles  of  the  pointers  of  the  representation  of  G  as  in  the  algorithm  for  directing  a 
tree.  The  cycles  can  be  merged  using  an  algorithm  similar  to  the  minimum-cost-spanning-forcst 
algorithm  [5, 7j.  Performance:  0(lg3  «). 

2.8  Concurrent  reads  and  writes 

This  section  explores  the  use  of  concurrent  reads  and  writes  to  memory  in  a  DRAM.  When 
concurrent  reads  and  writes  arc  allowed,  the  definition  of  load  must  be  modified  so  that  the  load 
factor  remains  a  lower  bound  on  the  time  to  deliver  a  set  of  messages.  With  the  new  definition 
comes  a  new  shortcut  lemma.  The  shortcut  lemma  makes  it  possible  to  perform  pointer- 
jumping  techniques  similar  to  recursive  doubling  in  a  conservative  fashion.  As  a  consequence, 
the  minimum-cost  spanning  forest,  connected  components,  and  biconnectcd  components  prob¬ 
lems  can  be  solved  in  O(lgn)  steps  by  conservative  algorithms.  These  algorithms  arc  faster 
than  the  corresponding  exclusive-read  exclusive-write  algorithms  from  the  preceding  section  by 
a  factor  of  Ig  n. 

A  concurrent  read  or  write  occurs  when  two  or  more  processors  attempt  to  read  or  write 
the  same  memory  location  in  a  single  memory  access  step.  We  shall  assume  that  when  several 
processors  make  requests  to  read  the  contents  of  a  location,  all  of  the  requests  are  satisfied.  The 
situation  is  more  complicated  when  several  processors  attempt  to  write  to  the  same  location. 


2.S.  CONCURRENT  READS  AND  WRITES 


10! 


Wc  shall  assume  that  there  is  some  simple  rule  for  combining  multiple  write  requests  to  the 
same  location.  For  example,  one  of  the  requests  may  be  arbitrarily  chosen  to  succeed  while  the 
others  are  denied,  or  the  sum  of  the  requests  may  be  written  into  the  location. 

2.8.1  A  new  definition  of  load 

A  new  measure  of  load  is  needed  to  model  the  implementation  of  concurrent  reads  and  writes 
by  an  underlying  routing  network.  When  several  processors  request  to  read  a  location,  it  is  only 
necessary  for  one  copy  of  the  data  in  that  location  to  cross  any  cut  of  the  underlying  network. 
Similarly,  since  multiple  writes  can  be  combined,  at  most  one  message  carrying  the  data  to  be 
written  into  any  particular  destination  needs  to  cross  any  cut.  The  old  definition  of  the  load 
of  a  set  of  messages  M  on  a  cut  5  was  the  number  of  messages  in  M  whose  source  is  in  S  and 
whose  destination  is  in  3,  or  vice  versa.  This  measure  overestimates  the  number  of  messages 
that  must  cross  the  cut  when  some  of  the  messages  have  the  same  destination  in  3,  and  can  be 
combined.  Consequently,  with  this  measure  of  load,  the  load  factor  is  not  necessarily  a  lower 
bound  on  the  time  to  deliver  a  set  of  messages.  The  new  definition  of  the  load  of  M  on  a  cut 
S  is  the  number  of  different  destinations  in  3  of  messages  originating  in  3,  or  vice  versa.  The 
definitions  of  a  cut,  the  capacity  of  a  cut,  the  load  factor,  and  a  conservative  a)gor..hm  remain 
the  same.  With  the  new  measure  of  load,  the  load  factor  is  a  lower  bound  on  the  time  required 
to  deliver  the  messages. 

The  change  in  the  definition  of  load  raises  the  hope  that  standard  PRAM  techniques  such  as 
recursive  doubling  are  conservative  after  all.  However,  returning  to  the  example  of  Figure  2-1, 
wc  see  that  after  the  fourth  step,  each  of  the  first  eight  elements  in  the  list  points  to  a  different 
element  on  the  other  side  of  the  cut.  Thus,  even  with  the  new  definition,  the  load  on  the  cut 
has  increased  from  one  to  eight  in  three  steps.  Nevertheless,  we  will  show  that  a  slightly  more 
sophisticated  pointer-jumping  strategy  is  conservative. 

2.8.2  A  shortcut  lemma  for  concurrent  reads  and  writes 

The  following  lemma  shows  that  if  all  of  the  pointers  into  a  particular  processor  are  simultane¬ 
ously  shortcut,  then  the  load  factor  does  not  increase.  Note  that  unlike  the  original  Shortcut 
Lemma,  the  pointer  6  -+  c  is  not  removed  from  P. 
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Lcmnu  47  Support  a  data  structure  consists  of  a  set  P  of  pointers  on  a  set  V  of  vertices  and 
that  P  contains  a  pointer  i-tc.  Let  R  =  {x  -*  b  :  z  €  Kx  —  b  6  P)  be  the  set  of  pointers  in 
P  into  b.  Then  the  set  Q  defined  by 

Q  =  P\J  {x  -*  c :  x  G  V}x  -*  b  £  R)  -  R 

is  conservative  with  respect  to  P. 

Proof:  We  will  show  that  load (Q,  5)  <  load(P,S)  for  any  cut  S  of  the  DRAM.  There  are  four 
ways  in  which  b  and  c  can  he  assigned  to  the  partition  induced  by  a  cut  S.  Two  of  the  cases 
can  be  eliminated  by  symmetry  if  we  assume  that  b  is  on  the  left  side.  In  both  of  the  remaining 
cases,  the  load  across  the  cut  is  either  unchanged  or  diminished  when  all  of  the  pointers  of  the 
form  x  b  arc  replaced  by  pointers  x  -*  c,  as  shown  in  Figure  2-6.  Note  that  if  b  and  c  lie  on 

the  left  side  of  the  cut,  then  all  of  the  pointers  into  b  from  the  right  side  of  the  cut  must  be 

« 

shortcut,  or  the  load  may  increase.  □ 

Corollary  48  Let  B  be  a  set  of  node*  in  V  that  art  independent  with  respect  to  P.  For  each 
yZBlety-*  c(y)  be  a  pointer  out  of  y.  Let  R  =  {x  -*  y :  x  €  V  y  €  B ,  x  -»  y  €  P)  be  the  set 
of  pointers  into  the  nodes  of  B.  Then  the  set  Q  of  pointers  defined  by 

Q-PU{x-*  c(y) :  x,y  6  P.x  -*  y  €  R)  -  R 

is  conservative  with  respect  to  P. 

Proof:  The  proof  is  by  induction  on  the  number  of  nodes  in  B.  □ 

2.8.3  A  conservative  pointer  jumping  technique 

The  corollary  suggests  the  following  conservative  tree  contraction  technique:  select  a  set  of 
independent  internal  (non-leaf)  nodes,  then  shortcut  all  of  the  pointers  into  those  nodes.  When 
the  pointers  into  a  node  (excluding  the  root)  are  shortcut,  the  node  becomes  a  leaf.  Thus, 
the  shortcutting  step  can  reduce,  but  not  increase,  the  number  of  internal  nodes.  The  step  is 
repeated  until  every  node  in  the  tree  (including  the  root)  points  to  the  root.  Such  a  tree  is 
called  a  star.  Note  that  unlike  the  tree  contraction  algorithm  from  Section  2.5,  the  number  of 
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Figure  2-6:  A  shortcut  lemma,  for  concurrent  reads  and  writes.  In  each  of  the  two  cases 
illustrated,  the  load  factor  across  the  cut  is  cither  unchanged  or  diminished  by  replacing  all  of 
the  pointers  of  the  form  *  -t  b  with  pointers  of  the  form  x  -►  c. 

nodes  in  the  tree  does  not  decrease  at  each  step,  and  the  in-degree  of  the  nodes  in  the  tree  can 
grow. 

It  is  relatively  easy  to  find  a  large  random  independent  set  of  internal  nodes.  First,  each 
internal  node  chooses  to  be  a  candidate  with  probability  1/2.  *  Next,  all  candidates  whose 
parents  have  also  initially  chosen  to  be  candidates  drop  out  of  the  running.  The  remaining 
candidates  form  an  independent  set.  At  each  step,  every  internal  node  except  the  root  has 
probability  1/4  of  belonging  to  the  set.  Since  the  root  points  to  itself,  it  will  never  be  included. 
By  Lemma  42  the  probability  that  at  least  1/8  of  the  internal  nodes  (excluding  the  root)  belong 
to  the  independent  set,  and  thus  become  leaves,  is  at  least  1/7. 

The  following  lemma  shows  that  if  the  independent  sets  are  found  this  way,  then  the  algo¬ 
rithm  requires  O(lgn)  steps,  with  high  probability. 

Lemma  40  With  high  probability,  the  randomized  pointer  jumping  algorithm  takes  O(lgn) 
steps  to  contract  an  n-nodc  directed  tree  to  a  star. 

Proof:  The  proof  is  nearly  identical  to  the  third  part  of  the  proof  of  Theorem  44.  □ 

This  conservative  tree  contraction  technique  can  be  applied  when  the  input  graph  has  the 
doubly-linked  incidence  ring  representation  from  Section  2.7.  The  representation  of  a  directed 
tree  is  itself  a  binary  tree.  After  applying  the  tree  contraction  algorithm  to  the  binary  tree,  all 
of  the  elements  in  the  representation  hold  pointers  to  the  principal  element  on  the  incidence 
ring  of  the  root.  Because  an  undirected  tree  can  be  rooted  at  any  node,  and  any  clement  on  the 
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incidence  ring  of  the  root  can  be  chosen  as  its  principal  element,  a  star  rooted  at  any  element 
in  the  representation  of  an  undirected  tree  is  conservative  with  respect  to  the  representation. 

2.8.4  A  minimum-cosi  spanning  forest  algorithm 

In  this  section  we  present  an  0(lg  n)*step  conservative  algorithm  for  finding  a  minimum-cost 

spanning  forest  of  an  n-node  graph.  The  algorithm  is  bared  on  the  CRCW  PRAM  minimum- 

cost  spanning  forest  algorithm  of  Awerbuch  and  Shiloach  [8],  which  in  turn  is  based  on  the 

connected  components  algorithm  of  Shiloach  and  Vishkin  [88]. 

A  minimum-cost  spanning  forest  is  defined  in  Section  2.7.  As  in  that  section,  we  assume 

without  Iocs  of  generality  that  all  edge  weights  are  distinct,  so  that  an  input  graph  G  =  (K,  E) 

has  a  unique  minimum-cost  spanning  forest  F. 

The  algorithm  demarcates  the  miuimum-coct  spanning  forest  by  marking  edges  as  belonging 

to  F.  Initially  no  edgee  are  marked.  At  each  step  of  the  algorithm,  the  currently  marked  edges 

form  a  subforest  of  F.  Each  connected  component  of  the  subforest  is  a  tree.  As  in  the  algorithm 

from  Section  2.7,  for  each  of  these  components,  the  algorithm  maintains  a  separate  directed 

tree  on  the  processors  in  the  adjacency-ring  representation  of  the  component.  However,  unlike 

that  algorithm,  the  edges  in  the  directed  tree  are  not  necessarily  a  subset  of  the  ring  and  edge 

pointers.  As  we  shall  see,  each  directed  tree  is  nevertheless  conservative  with  respect  to  the 

adjacency- ring  input  representation  of  the  corresponding  component.  We  denote  the  set  of 

directed  trees  {2;},  where  the  index  i  of  the  tree  is  the  address  of  the  root.  Initially,  each  node 

in  G  is  an  isolated  component,  and  its  directed  tree  is  a  star  on  the  nodes  of  its  adjacency  ring. 

When  the  algorithm  terminates,  each  directed  tree  is  a  star  on  the  nodes  in  the  adjacency-ring 

representation  of  a  different  connected  component  of  F. 

The  algorithm  proceeds  in  phases,  each  consisting  of  two  basic  steps:  star-hooking  and 

pointer-jumping.  In  the  star-hooking  step,  the  lowest  co6t  edge  connecting  each  star  in  {T;} 

to  another  component  is  marked  as  belonging  to  F ,  and  the  root  of  the  star  is  made  a  child 

of  a  node  in  the  neighboring  component.  The  pointer-jumping  step  is  the  same  as  that  in  the 

* 

tree-contraction  algorithm.  The  algorithm  repeats  these  steps  until  {?;}  consists  entirely  of 
stars,  and  none  of  these  stars  have  any  neighbors  in  G. 

We  now  describe  the  star-hooking  step  in  more  detail.  The  first  task  is  to  determine  which 
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component  (if  any)  i$  adjacent  to  each  star  via  the  lowest-cost  edge.  Each  processor  in  a 
star  whose  edge  pointer  leads  outside  the  star  write*  the  cost  of  the  edge  to  the  root.  These 
concurrent  writes  are  combined  using  the  min  operator,  so  that  the  lowest  edge  cost  wins.  If 
the  star  has  no  neighbors,  then  it  becomes  inactive.  Also,  if  two  stars  select  the  same  edge, 
then  the  one  with  the  lower  index  doe*  nothing.  Before  the  star  is  hooked  into  to  another  tree, 
it  is  rcrooted  at  the  winning  processor.  The  new  root  is  Iwoked  into  the  neighboring  tree  via 
its  edge  pointer.  If  the  node  at  the  end  of  the  edge  pointer  is  a  leaf,  then  the  edge  pointer  is 
shor.cut  so  that  the  root  points  to  the  parent  of  the  leaf.  This  last  operation  ensures  both  that 
the  star-hooking  step  is  conservative  and  that  it  does  not  increase  the  number  of  internal  nodes 
in  {£•}. 

The  following  pair  of  theorems  show  that  the  algorithm  is  conservative  and  that  it  requires 
0(!g  n)  phases,  with  high  probability. 

Theorem  50  With  high  probability,  the  algorithm  require*  0(lg  n)- phases  to  find  the  minimum- 
cost  spanning  forest  of  an  n-nodc  graph. 

Proof:  We  bound  the  number  of  phases  using  a  potentia*  function  argument.  The  quantity 
of  interest  is  the  number  of  internal  nodes  in  active  trees  in  {7;}.  Initially,  there  is  a  star  of 
height  1  for  each  of  the  n  nodes  in  G ,  so  there  are  n  internal  nodes.  The  star-hooking  step 
does  not  increase  the  number  of  internal  nodes.  After  star  hooking  there  are  no  active  stars 
remaining,  so  every  tree  has  height  at  least  2.  Since  roots  are  not  included  in  the  independent 
set,  in  the  worst  case  we  expect  1/8  of  the  internal  nodes  to  be  placed  in  the  set.  By  Lemma  42 
the  probability  that  at  least  1/16  of  the  internal  nodes  become  leaves  is  at  least  1/15.  The 
remainder  of  the  proof  is  like  the  third  part  of  the  proof  of  Theorem  44.  □ 

Theorem  51  The  algorithm  is  conservative. 

Proof:  The  key  to  the  proof  is  that  at  the  beginning  of  each  phase,  the  set  of  directed  trees, 
{7}},  is  conservative  with  respect  to  the  adjacency-ring  representation  of  the  input  g"?.ph. 
The  proof  is  by  induction  on  the  number  of  phases  completed.  Before  the  first  phase,  {TV} 
consists  of  a  set  of  stars,  one  for  each  node  in  the  input  graph.  Each  star  is  conservative 
with  respect  to  the  ring  pointers  in  its  adjacency  ring.  Now  assume  the  inductive  hypothesis. 
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The  star-hooking  step  consists  of  rcrootiug  some  stars,  and  hooking  them  into  adjacent  trees, 
llcrooting  is  justified  because,  as  we  have  previously  observed,  a  star  rooted  at  any  node  in 
the  adjaccrcy-ring  representation  of  the  corresponding  component  is  conservative  with  respect 
to  that  representation.  When  hooking  a  root  into  a  node  in  an  adjacent  tree  via  an  edge 
pointer,  we  must  be  ensure  that  the  edge  pointer  is  shortcut  in  the  same  way  that  any  other 
pointers  into  that  nede  have  been  shortcut.  If  the  node  is  a  leaf,  then  it  may  have  belonged 
to  the  independent  set  in  some  previous  pointer-jumping  step.  In  this  case,  the  root  must 
be  hooked  into  the  node’s  parent.  If  the  node  is  not  a  leaf  then  the  pointers  into  the  node 
have  never  been  shortcut.  In  this  case,  the  root  must  be  hooked  into  the  node  via  its  edge 
pointer.  In  the  pointer-jumping  step,  the  pointers  Into  an  independent  set  of  the  nodes  in  {7j} 
are  shortcut.  By  Lemma  4S  the  resulting  set  of  trees  remains  conservative  with  res  per'"  o  the 
input  representation. 

All  communication  in  the  algorithm  is  performed  across  edge  pointers  and  directed  tree 
pointers.  The  edge  pointers  are  a  subset  of  the  pointers  in  the  input  representation,  and  as  we 
have  just  proved,  the  tree  pointers  arc  conservative  with  respect  to  the  representation.  □ 

The  algorithm  can  used  as  a  subroutine  in  0(lg  n)-step  algorithms  for  finding  the  connected 
and  biconnccted  components  of  an  n-node  graph.  The  details  of  the  reductions  are  presented 
in  Section  2.7. 

The  algorithm  can  be  made  deterministic  using  the  deterministic  coin-tossing  algorithm  of 
Cole  and  Vishkin  [20],  The  goal  is  to  find  a  large  independent  set  of  non-root  internal  nodes 
without  using  randomization.  Let  k  be  the  number  of  internal  odes.  The  first  step  is  to 
remove  the  leaves  of  {Ti)  so  that  k  nodes  remain.  Next,  remove  the  roo'  -.  Since  every  tree  has 
height  at  least  1  after  the  removal  of  the  leaves,  at  least  Jfc/2  nodes  are  left.  Next,  remove  any 
remaining  nodes  with  2  or  more  children.  Since  there  arc  k/2  edges  (including  the  self-pointers 
at  the  roots),  this  step  removes  at  most  kj 4  nodes.  At  this  point  the  graph  consists  of  chains 
only  of  chains.  The  deterministic  coin-tossing  technique  can  be  used  to  select  an  independent 
set  of  at  least  k/ 12  nodes  in  0(lg*  m)  steps,  where  m  is  the  number  of  processors  in  the  DRAM. 
Thus,  the  time  for  the  algorithm  is  0(lgnlg’ m). 
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2.9  Remarks 

This  section  offers  %  perspective  on  the  DRAM  model.  NVe  explore  the  analogy  between  PRAM’s 
and  universal  networks  on  the  one  hand,  and  D'RAM’s  and  volume-universal  networks  on 
the  ether.  We  then  discuss  the  issue  of  how  data  structures  can  be  efficiently  embedded  in 
DRAM’s— a  problem  not  faced  in  the  PRAM  model.  We  also  suggest  how  one  might  define 
the  load  factor  for  data  structures  other  than  graphs,  such  as  matrices.  Finally,  we  offer  some 
comments  on  how  some  of  our  definitions  and  techniques  might  be  extended  or  generalised. 

The  literature  contains  a  large  body  of  results  concerning  universal  networks,  such  ax  the 
Boolean  hypercubc  [96].  Universal  networks  are  capable  of  simulating  any  PRAM  program 
with  at  most  polylogarithmic  degradation  in  time  (see,  for  example,  the  simulation  [35]  of  an 
EREW-PRAM  on  a  butterfly  network).  In  light  of  this  work,  one  might  wonder  why  the  DRAM 
model  should  be  studied  at  all. 

A  potential  problem  with  universal  networks  is  that  they  may  be  difficult  to  physically  con¬ 
struct  on  a  large  scale.  The  number  of  external  connections  (pins)  on  a  packaging  unit  (chip, 
board,  rack,  cabinet)  of  an  electronic  system  is  typically  much  smaller  than  the  number  of  com¬ 
ponents  that  the  packaging  unit  contains,  and  can  be  made  larger  only  with  great  cost.  When 
a  network  is  physically  constructed,  each  packaging  unit  contains  a  subset  of  the  processors  of 
the  network,  and  thus  determines  a  cut  of  the  network.  For  a  universal  network,  the  capacity 
of  every  cut  must  be  nearly  as  large  as  the  number  of  processors  on  the  smaller  side  of  the  cut; 
otherwise,  the  load-factor  lower  bound  would  make  it  impossible  to  perform  certain  memory 
accesses  in  polylogarithmic  time.  Thus,  when  a  universal  network  is  physicat.y  constructed, 
the  number  of  pins  on  a  packaging  unit  must  be  nearly  as  large  as  the  number  of  processors 
in  the  unit.  Consequently,  if  all  the  pin  constraints  are  met,  a  packaging  unit  cannot  contain 
as  many  processors  as  might  otherwise  fit.  Alternatively,  if  each  packaging  unit  contains  its 
full  complement  of  processors,  then  pin  limitations  preclude  the  universal  network  from  being 
assembled. 

The  impact  of  pin  constraints  can  be  modeled  theoretically  in  the  three-dimensional  VLSI 
model  [51, 56]  where  hardware  cost  is  measured  by  volume  and  the  pinboundedness  of  a  region 
is  measured  by  its  surface  area.  In  this  model,  the  largest  universal  network  that  can  fit  in  a 
given  volume  V  has  only  about  V2/3  nodes.  In  the  two-dimensional  VLSI  model  [93],  where 
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pinboundcdncss  is  measured  by  perimeter,  the  bound  is  even  worse. 

Since  the  density  of  processors  in  a  physical  implementation  of  a  universal  network  is  low, 
it  is  natural  to  wonder  whether  there  arc  other  networks  that  make  more  efficient  use  of  hard¬ 
ware.  Recently,  it  has  been  shown  that  fat-tixct  [29,  56]  aw  such  a  class  of  “volume-universal” 
networks.  A  fat-tree  of  volume  V  can  simulate  any  other  network  of  comparable  volume  with 
only  polylogarithmic  degradation  in  time.  (Figure  2-7  shows  an  area-universal  fat-tree.)  Thus, 
a  fat-tree  of  volume  V  can  efficiently  simulate  not  only  the  universal  networks  with  the  same 
volume,  but  also  some  networks  with  almost  V  nodes.  A  key  component  in  the  proof  that 
fat-trees  arc  volume-universal  is  an  algorithm  for  routing  a  set  of  messages  on  a  fat-tree  in  time 
that  is  at  most  a  polylogarithmic  factor  larger  than  the  load  factor. 

With  a  suitable  assignment  of  capacities  to  cuts,  the  DRAM  can  abstract  the  essential 
communication  characteristics  of  volume  and  area-universal  networks  without  relying  in  detail 
on  any  particular  network.  Much  as  th-n  PRAM  can  be  viewed  as  an  abstraction  of  a  hypercube, 
in  that  algorithms  for  a  PRAM  can  be  implemented  on  a  hypcrcubc  with  only  polylogarithmic 
performance  degradation,  the  DRAM  can  be  viewed  as  an  abstraction  of  a  volume  or  area- 
universal  network.  Fast,  communication-efficient  algorithms  on  a  DRAM  with  the  appropriate 
cut  capacities  translate  directly  to  fast,  communication-efficient  algorithms  on,  for  example,  a 
fat-tree. 

We  now  turn  to  the  problem  of  embedding  data  structure*  in  DRAM’s,  a  problem  that 
must  be  faced  by  users  of  conservative  algorithms  if  the  algorithms  arc  to  run  quickly.  In 
general,  the  problem  of  determining  the  best  embedding  for  an  arbitrary  data  structure  is  NP- 
complete,  but  for  many  common  situations,  good  embeddings  can  be  found.  Moreover,  there 
arc  many  situations  in  which  the  input  graph  structure  is  simple  and  known  a  priori ,  and  a 
good  embedding  may  be  easy  to  construct. 

To  illustrate  how  the  embedding  problem  can  be  solved  in  certain  practical  situations, 
consider  the  class  of  DRAM’s  whose  cut  capacities  correspond  to  arca^univcrsal  fat-trees.  For 
this  class  of  DRAM’s,  the  recursive  structure  of  the  underlying  fat-tree  network  suggests  that  a 
dividc-and-conquer  approach  be  taken.  For  example,  a  subproblem  in  switch-level  simulation  of 
a  VLSI  circuit  is  the  finding  of  electrically  equivalent  portions  of  the  circuit.  A  naive  divide-  and- 
conquer  embedding  of  the  circuit  on  the  fat-tree  would  yield  small  load  factors  for  every  cut. 
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Figure  2-7:  A  fat-tree  network.  An  area-universal  fat-tree,  like  the  one  shown,  is  capable  of 
cfllciently  simulating  any  other  network  of  comparable  area.  Fat-trees  are  well  modeled  by 
distributed  random-access  machines. 
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Thus,  our  conservative  connected-components  algorithm  would  never  cause  undue  congestion 
in  communicating  messages  in  the  underlying  network,  and  the  algorithm  would  run  as  fast  as 
on  an  expensive,  high-bandwidth  network. 

For  some  graphs,  it  can  be  proved  that  divide  and  conquer  yields  near-optimal  embeddings 
on  a  fat-tree.  Specifically,  graphs  for  which  a  good  separator  theorem  [62]  exists  can  be  embed¬ 
ded  well.  Examples  include  meshes,  trees,  planar  graphs,  and  multigrids.  Situations  in  which  a 
mesh  might  be  used  include  systolic  array  computation  [44,  55]  and  image  processing.  Planar 
graphs  and  multigrids  arise  from  the  solution  of  sparse  linear  systems  of  equations  based  on 
the  finite-element  method.  Consequently,  conservative  DRAM  algorithms  operating  on  good 
embeddings  of  these  graphs  would  run  fast  on  a  fat-tree. 

The  algorithms  presented  in  this  chapter  operate  primarily  on  graphs  for  which  there  is  a 
natural  definition  of  load  factor.  It  is  also  possible  to  define  the  load  factor  of  a  data  structure 
that  contains  no  explicit  pointers.  For  example,  it  is  natural  to  superimpose  a  mesh  on  the 
matrix,  as  is  suitable  for  systolic  array  computation  [4*1 ,  55],  and  the  load  factor  of  the  matrix 
can  be  defined  as  the  load  factor  of  the  mesh. 

For  some  algorithms,  the  running  time  may  be  better  characterized  as  a  function  of  the  load 
factor  of  the  output  than  the  load  factor  of  the  input.  As  an  example,  consider  the  problem 
of  sorting  a  linear  list  of  elements.  A  natural  question  tc  ask  is  whether  a  list  can  be  sorted 
in  a  polylogarithmic  number  of  steps  where  at  each  step,  the  load  factor  is  bounded  by  the 
load  factor  induced  by  the  linear  list  together  with  the  permutation  determined  by  the  sorted 
output.  Such  a  sorting  algorithm  is  known  for  fat-trees  [36],  but  whether  such  an  algorithm 
exists  for  general  DRAM's  is  an  open  question. 

Whereas  the  Shortcut  Lemma  presented  in  this  chapter  holds  for  any  network,  for  particular 
networks,  other  shortcut  lemmas  may  hold.  For  example,  another  shortcut  lemma  for  fat-trees 
is  used  in  [64]  to  show  that  an  optimal  reordering  of  a  linear  list  in  a  fat-tree  can  be  determined 
efficiently  by  a  conservative  algorithm  on  the  fat-tree. 

As  a  final  comment,  we  note  that  the  notion  of  a  conservative  algorithm  may  well  be  too 
conservative.  As  a  practical  matter,  it  is  probably  not  worth  worrying  whether  every  set  of 
memory  accesses  is  conservative  with  respect  to  the  input,  as  long  as  the  load  factor  of  memory 
accesses  is  not  much  greater  than  the  input  load  factor.  For  example,  a  contraction  tree  is  not 
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conservative  with  respect  to  its  input  tree  (though  the  levels  of  the  contraction  tree  arc),  but 
the  load  factor  of  the  contraction  tree  is  at  most  0(lg  n)  times  the  input  load  factor.  Algorithms 
with  this  looser  bound  are  somewhat  easier  to  code  because  of  the  relaxed  constraint,  and  they 
should  perform  comparably. 
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Chapter  3 


Work-preserving  emulations 


3.1  Introduction 

In  this  chaptct,  we  study  the  problem  of  emulating  an  iYc’nodc  guest  network  G  =  (Vo,  Eg) 
on  an  IVj/- node  host  network  II  ~  (V)/,£//)  where  Nh  <  Nq-  Our  goal  is  to  emulate  Tq  «tcps 
of  any  computation  on  G  in  Th  =  STq  steps  on  Jl  where  S  (the  slowdown  of  the  emulation)  is 
as  small  as  possible. 

The  slowdown  of  the  emulation  must  always  be  at  least  as  large  as  Nq/Nh  since  G  has 
Nq/Nh  times  as  many  processors  as  does  If.  If  S  =  0(Ng/Nh) i  then  we  say  that  the  emulation 
is  work-preserving  because  then  the  total  work  (i.e.,  the  processor-time  product)  performed  by 
the  emulating  network  (IV)/  =  ThNh)  is  within  a  constant  factor  of  the  work  performed  by  the 
guest  network  (IVc  =  TgNg)-  Such  emulations  achieve  optimal  speedup  (to  within  a  constant 
factor)  over  sequential  emulations  of  G  since  they  use  Nu  processors  to  solve  a  problem  Q(Nu) 
times  faster  than  is  possible  with  a  single  processor. 

More  generally,  we  say  that  there  is  a  work-preserving  emulation  of  a  class  of  networks  Q 
by  a  class  of  networks  H  with  slowdown  S(N)  if  for  every  N  and  T,  we  can  emulate  any  T 
steps  of  any  S(N)N- node  network  in  Q  in  0(S(N)T)  steps  on  any  W-nodc  network  in  ft.  If 
S(N)  =  O(log0f  N )  for  some  constant  a,  then  we  say  that  the  emulation  is  NC  work-preserving 
since  every  step  of  G  can  be  emulated  in  0(log°  N)  steps  of  If.  If  S(N)  =  0(N°)  for  some 

This  chapter  describes  joint  research  with  Richard  Koch,  Tom  Leighton,  Satish  Rao,  and  Arnold  Rosenberg 
[40]. 
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constant «,  then  we  say  that  the  emulation  is  polynomial  lime  teork-preterving,  and  so  on.  In 
the  special  case  that  S(N)  =  0(1),  we  say  that  the  emulation  is  real-time.  Real-time  emulations 
arc  the  hardest  to  obtain  since  we  require  the  host  network  to  emulate  a  guest  network  of  the 
same  size  with  constant  slowdown. 

As  a  simple  example,  let  Q  be  the  class  of  linear  arrays,  and  H  be  the  class  of  all  bounded- 
degree  connected  networks.  It  is  well  known  ($?]  that  an  jV-node  linear  array  can  be  embedded 
one-to-one  in  any  connected  bounded-degree  tf-nodc  network  with  constant  dilation  and  con¬ 
gestion.  (Ry  an  embedding  of  a  graph  G  into  a  graph  //,  we  mean  a  mapping  <j>  :  G  -*  // 
that  maps  the  nodes  of  G  to  the  nodes  of  II  and  the  edges  of  G  to  paths  in  U .  The  dilation 
of  an  embedding  is  the  length  of  the  longest  path  <£(c)  corresponding  to  an  edge  of  G.  The 
congestion  of  an  embedding  is  the  largest  number  of  paths  £{c)  crossing  a  single  edge  of  //. 
The  load  of  an  embedding  is  the  maximum  number  of  nodes  of  G  mapped  to  a  single  node  of 
II.  In  a  one-to-one  embedding,  the  load  is  1.)  lienee  any  JV-node  bounded  degree  connected 
network  II  can  emulate  any  Af-nodc  linear  array  with  constant  slowdown,  and  thus  there  is  a 
real-time  emulation  of  the  class  G  by  the  class  M. 

As  another  simple  example,  consider  the  more  interesting  problem  of  emulating  a  butterfly 
on  a  linear  array.  We  will  prove  that  the  class  of  butterflies  cannot  be  real-time  emulated  by 
the  class  of  linear  arrays.  (This  should  come  as  no  surprise,  although  the  proof  is  not  entirely 
trivial.)  However,  there  is  a  simple  work-preserving  emulation  of  the  class  of  butterflies  by  the 
class  of  linear  arrays  with  slowdown  2^.  In  particular,  consider  an  N2N-node  butterfly  with 
nodes  and  edges 

V  =  {{|>)|1  <  i  <  JV,u>  e  {0,1}*},  and 

E  =  {((»>),  (i>'))|i'  =  i  +  l,u/  =  u>  or  w'  =  u>W), 

where  teW  denotes  w  except  that  the  ith  bit  is  changed.  Then  by  mapping  the  2N  nodes  of  the 
form  (i,  to)  (where  w  €  {0,1}^)  to  the  ith  node  of  the  linear  array,  an  iV-node  linear  array  can 
emulate  an  iV2w-node  butterfly  with  2N  slowdown. 

Seeing  this  elementary  example,  one  is  tempted  to  ask  if  there  are  faster  work-preserving 
emulations  of  a  butterfly  on  a  linear  array.  In  other  words,  can  we  emulate  a  smaller  butterfly 
(say  with  polynomial  blowup)  in  a  work-preserving  fashion  on  a  linear  array?  Although  the 
proof  is  not  obvious,  the  answer  is  no.  There  is  no  polynomial-time  work  preserving  emulation 
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of  the  cl uk  of  butterflies  by  the  class  of  linear  arrays.  Any  such  emulation  requires  exponential 
slowdown.  Alternatively,  we  might  wonder  if  a  linear  array  can  emulate  any  bounded-degree 
network  in  a  work-preserving  fashion  given  enough  slowdown.  Again,  the  answer  is  no.  Al¬ 
though  the  linear  array  can  emulate  a  butterfly  in  a  work-preserving  fuhion,  it  cannot  emulate 
any  expander,  no  matter  how  much  blowup  is  allowed.  In  fact,  by  combining  these  results 
we  can  conclude  that  even  a  butterfly  is  not  sufficiently  powerful  to  emulate  an  expander  in  a 
work-preserving  fuhion. 

We  also  consider  emulations  that  are  not  work-preserving.  Such  emulations  are  (by  defi¬ 
nition)  inefficient,  and  wc  define  the  inefficiency  of  such  an  emulation  to  be  /  =  Wh/W'g-  In 
these  terms,  an  emulation  is  work-preserving  if  it  hu  constant  inefficiency.  Many  of  our  bounds 
will  reflect  tradeoffs  between  slowdown  and  inefficiency.  In  general, 


where  C  =  Ng/Nh  »*  the  contraction  of  an  emulation. 

3.1.1  The  motivation 

There  are  several  good  reasons  for  studying  the  problem  of  emulating  one  network  on  another 
in  a  work-preserving  fuhion.  First,  this  kind  of  analysis  gives  us  an  excellent  means  by  which  to 
compare  the  computational  power  of  one  network  relative  to  that  of  another.  More  importantly, 
it  gives  us  an  automatic  way  to  compile  and  run  algorithms  designed  for  one  kind  of  parallel 
architecture  without  loss  of  efficiency  on  another.  This  is  provided,  of  course,  that  the  ratio 
of  the  size  of  the  problem  to  the  size  of  the  machine  is  large  enough.  For  example,  we  have 
already  seen  that  a  small  linear  array  (which  has  a  very  simple  structure)  is  just  as  efficient  in 
terms  of  work  as  a  very  large  butterfly  (which  hu  a  more  complicated  structure). 

More  generally,  the  study  of  work-preserving  emulations  lies  at  the  heart  of  efficient  parallel 
computing.  Indeed,  one  of  the  central  problems  in  efficient  parallel  computing  is  the  tuk  of 
mapping  a  collection  of  processes  linked  by  precedence  and/or  communication  constraints  onto 
the  processors  and  routing  network  of  a  parallel  machine  so  that 

1.  the  processing  load  imposed  on  the  processors  is  balanced, 

2.  the  communication  between  processors  can  be  handled  efficiently,  and 
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3.  the  computation  ami  communication  can  he  scheduled  so  that  the  necessary  inputs  for  a 
process  are  available  where  and  when  the  process  is  scheduled  to  be  computed. 

In  other  words,  we  would  like  to  schedule  the  communication  and  computation  in  a  way  that 
takes  maximum  advantage  of  the  available  hardware  to  minimize  the  completion  time  of  the 
job. 

In  general,  we  can  model  the  computation  to  be  performed  by  a  DAG.  Each  node  of  the 
DAG  represents  a  process  and  each  directed  edge  (u,  w)  represents  a  communication  that  must 
take  place  between  u  and  u.  Typically,  this  communication  represents  data  output  from  u 
after  u  is  completed  which  is  to  be  input  to  u  before  v  is  started.  The  parallel  machine  can 
be  modeled  as  an  undirected  network.  The  nodes  of  the  network  correspond  to  processors, 
and  the  edges  correspor. '  o  communication  links  between  processors  (and/or  their  associated 
memories).  The  implementation  of  the  computation  to  be  performed  on  the  parallel  machine 
then  corresponds  to  an  embedding  of  the  DAG  in  the  network  so  that  nodes  of  the  DAG  arc 
mapped  to  nodes  of  the  network  and  so  that  edges  of  the  DAG  arc  mapped  to  paths  in  the 
network.  We  may  also  need  to  construct  a  schedule  that  spcciftcs  the  communication  and 
computation  of  the  DAG  that  is  being  performed  during  each  step  of  the  network.  This  will  be 
particularly  important  if  the  parallel  machine  is  synchronous. 

In  many  applications,  the  DAG  possesses  a  very  natural  structure.  For  example,  typical 
l)AGs  encountered  in  practirc  arc  derivatives  of  a  binary  tree,  array,  butterfly,  or  shuffle* 
exchange  graph.  This  is  often  due  to  the  fact  that  the  DAG  is  associated  with  an  algorithm 
whose  inherent  underlying  structure  is  a  tree  or  array  (as  is  the  case  for  many  problems  in 
numerical  analysis  and  linear  algebra)  or  a  butterfly  or  shuffle-exchange  graph  (as  is  the  case 
for  Fourier  Transform  and  data  manipulation  problems).  Alternatively,  it  could  be  that  the 
DAG  was  constructed  from  an  algorithm  specifically  designed  for  use  on  one  of  these  common 
parallel  architectures. 

Similarly,  parallel  networks  also  tend  to  be  very  naturally  structured  and  typically  are 
configured  as  trees,  arrays,  butterflies,  and  the  like.  Hence,  the  mapping  problem  often  consists 
of  emulating  2b  steps  of  one  Ab-node  network  (represented  as  a  2bAb-node  DAG)  on  an  JV//- 
node  network  with  a  different  structure.  Ideally,  we  would  like  to  perform  the  computation  in 
0(TgNq/Nj[)  steps,  which  is  precisely  the  problem  of  finding  a  work-preserving  emulation  of 
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one  network  on  another. 

In  practice,  the  guest  network  can  be  substantially  larger  than  the  host  network.  For 
example,  it  is  not  uncommon  for  a  parallel  machine  with  between  S  and  256  processors  to 
be  emulating  array-based  computations  involving  hundreds  of  thousands  of  data  points.  In 
such  examples,  even  work-preserving  emulations  with  exponential  slowdown  may  be  within  the 
scope  of  practicality.  Indeed,  the  most  important  feature  of  the  computation  is  that  it  be 
work-preserving. 

3.1.2  A  closer  look  at  the  computational  model 

If  we  can  find  an  embedding  of  a  graph  G  into  a  graph  H  with  constant  dilation,  congestion,  and 
load,  then  it  is  fairly  clear  that  II  can  emulate  G  with  constant  slowdown.  Is  the  reverse  true? 
Somewhat  surprisingly,  it  is  not.  For  example,  Bhatt,  Chung,  llong,  Leighton  and  Rosenberg 
[11]  proved  that  any  embedding  of  an  Af-node  mesh  into  an  JV-node  butterfly  with  constant  load 
requires  dilation  ft(logjV),  the  worst  possible.  At  first  glance,  it  might  seem  that  this  result 
implies  that  any  emulation  of  an  lY-node  mesh  by  and  JV-nodc  butterfly  must  have  slowdown 
at  least  O(logAf).  However,  in  this  chapter  we  show  that  an  JV-node  butterfly  can  emulate 
T-steps  of  an  N- node  mesh  in  O(TloglogA0  steps.  In  [40]  we  present  ^  more  sophisticated 
emulation  scheme  that  requires  only  O(T)  steps. 

In  order  to  understand  how  such  a  contradictory  result  is  possible,  we  need  to  take  a  closer 
look  at  what  it  means  to  emulate  To  steps  of  one  network  in  Tu  steps  on  another.  We  start 
by  modeling  the  computation  performed  by  the  guest  network  G  as  a  pebble  DAG  V.  In 
particular,  we  will  have  a  pebble  for  every  node-time  pair  (t>,  t)  where  v  is  a  node  of  G  and 
0  <  t  <  To.  (Pairs  of  the  form  (v,0)  correspond  to  inputs.)  In  fact,  we  may  have  many  pebbles 
associated  with  a  single  pair  (u,t),  which  will  correspond  to  the  santc  computation  being  done 
more  than  once.  (This  is  the  trick  that  allows  us  to  emulate  a  mesh  on  a  butterfly  in  real 
time.)  To  compute  any  pebble  labeled  (u,f),  we  need  as  inputs  pebbles  labeled  (t/,f  -  1)  and 
(t»i,t  —  l),(v2,f  —  1), . . . , (wjfc, i  —  1),  where  tq,  v? are  the  neighbors  of  v  in  G.  We  use  the 
directed  edges  of  T  to  denote  this  dependence  in  the  usual  way. 

Because  many  pebbles  can  have  the  same  label,  there  are  many  DAGs  T  associated  with  any 
graph  G.  In  order  to  emulate  G  on  //,  we  only  need  to  find  an  embedding  and  an  accompanying 
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schedule  of  one  of  these  DAGs  In  If.  Once  an  embedding  and  schedule  of  a  DAG  Is  fixed,  the 
emulation  proceeds  in  a  standard  way.  In  particular,  during  each  step  of  the  computation,  a 
node  of  II  can 

1.  make  a  copy  of  a  single  pebble  that  it  contains, 

2.  send  a  single  pebble  to  a  neighbor,  and/or 

3.  create  a  pebble  with  label  (u,  t)  provided  that  it  already  contains  input  pebbles  with  labels 

(M-  1)  and  (vi,f  —  l),(t*j,<  —  1). 

Initially,  we  will  allow  a  node  of  It  to  have  access  to  any  input,  although  to  use  any  of 
these  inputs  in  a  meaningful  way  will  take  timr.  By  the  end  of  the  emulation,  we  must  have 
computed  pebbles  with  all  labels  of  the  form  (v,7’c)-  (For  purposes  of  simplicity,  we  will  use  a 
pebble  to  denote  the  state  of  a  processor  of  G  at  some  particular  time,  a*  described  above.  A 
more  general  interpretation  would  be  to  use  a  pebble  to  denote  one  of  many  items  (e.g.,  data 
and/or  functions)  stored  within  a  processor.  All  of  our  results  hold  under  the  more  general 
interpretation,  although  some  of  the  emulation  results  become  more  complicated.) 

By  allowing  several  pebbles  to  have  the  same  label,  we  dramatically  increase  the  number 
of  possible  computation  DAGs  T  that  correspond  to  a  7b*stcp  computation  of  G.  This  makes 
it  more  likely  that  we  can  find  a  computation  that  can  be  efficiently  emulated  on  some  host 
network  II  (e.g.,  as  is  the  case  with  emulating  a  mesh  on  a  butterfly),  but  it  also  makes  the  task 
of  proving  lower  bounds  much  more  difficult.  For  example,  in  order  to  prove  that  II  cannot 
emulate  G  in  real-time,  we  must  show  that  for  some  7c,  there  is  no  DAG  T  associated  with  a 
Tc-step  computation  of  G  that  can  be  emulated  in  0(7c)  steps  on  II.  This  can  be  a  formidable 
task  since  T  can  look  very  different  than  G.  Indeed,  at  the  very  least,  we  must  choose  To  to 
be  large  since  by  allowing  redundant  computations  of  pebbles,  any  0(1)  steps  of  any  AF-nodc 
bounded-degree  graph  G  can  be  computed  in  0(1)  steps  on  any  Af-nodc  graph  II.  (This  is 
because  if  T  =  0(1),  then  any  output  pebble  can  only  depend  on  0(1)  input  pebbles,  which 
can  be  redundantly  computed  locally  since  every  node  of  II  is  assumed  to  have  access  to  all 
input  pebbles.) 

Note  that  when  we  prove  a  lower  bound  on  the  ability  of  a  graph  II  to  emulate  a  graph 
Gy  it  does  not  necessarily  mean  that  II  cannot  effectively  compute  the  same  result  as  does  G 
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(possibly  by  using  *  different  algorithm,  for  example).  Rather,  we  are  proving  lower  bounds 
on  the  ability  of  It  to  perform  the  same  step-by-step  computations  as  G  when  G  is  used  in 
a  general  purpose  way.  Hence  the  term  emulation.  We  suspect  that  our  pebbling  model  is 
probably  the  most  general  model  in  which  we  could  hope  to  prove  lower  bounds. 

Throughout  the  chapter  we  will  make  use  of  the  fact  that  if  there  is  an  embedding  of  G 
in  //  with  congestion  c,  dilation  d,  and  load  /,  then  there  is  an  emulation  of  G  by  II  with 
slowdown  0(1  -f  c  +  d).  The  follows  from  the  proof  in  Section  1.2  that  for  any  set  of  packets 
whose  paths  have  congestion  c  and  dilation  d,  there  is  a  schedule  of  length  0(c  +  d)  in  which 
at  most  one  packet  traverses  each  edge  at  each  step.  When  It  is  an  array,  tree,  butterfly,  or 
shufttc-cxchange  graph,  the  schedule  can  be  computed  on-line  using  the  algorithm  for  layered 
networks  from  Section  1.3. 

3.1.3  Our  results 

t 

The  technical  portion  of  this  chapter  is  divided  into  five  sections.  We  commence  in  Section  3.2 
with  some  general  techniques  for  establishing  the  existence  or  nonexistence  of  a  work-preserving 
emulation.  In  particular,  we  describe  two  general  methods  for  proving  lower  bounds  on  the 
slowdown  of  a  work-preserving  emulation.  The  first  method  is  based  on  dilation  considerations 
and  appears  in  Section  3.2.1.  As  an  application  of  this  method,  we  prove  that  any  class  of  low 
diameter  networks  (such  as  complete  binary  trees)  cannot  be  emulated  In  real  time  on  any  class 
of  networks  that  has  poor  expansion  properties  (such  as  arrays  of  bounded  dimension). 

The  second  method  is  based  on  congestion  properties  and  is  presented  in  Section  3.2.2. 
Here  we  describe  a  general  method  for  proving  that  a  work-preserving  emulation  requires  a 
large  amount  of  time,  or  that  it  is  impossible  altogether.  As  an  example,  we  prove  that  any 
work-preserving  emulation  of  a  butterfly  on  an  array  of  bounded-dimension  requires  exponential 
time,  and  that  it  is  not  possible  to  emulate  an  expander  on  a  butterfly  in  work-preserving 
fashion.  These  results  provide  a  curious  contrast  between  the  power  of  a  linear  array,  butterfly, 
and  an  expander.  By  most  standards,  it  would  seem  that  a  butterfly  is  closer  in  power  to  an 
expander  than  it  is  to  a  linear  array.  Yet  a  linear  array  can  emulate  a  butterfly  in  a  work- 
preserving  fashion,  but  a  butterfly  (or  most  any  non-expander)  cannot  emulate  an  expander  in 
a  work-preserving  fashion. 


120 


CIIA  PTER  3.  WORK-PRESERVING  EM  UL AT  10 NS 


In  Section*  3.3  (  .trough  3.0',  we  focus  on  the  special  case  of  emulation*  by  array*,  complete 
■sinary  tree*,  butterflies,  and  f,hiifllc-cxchangc  graphs,  respectively.  In  Section  3.3,  we  prove 
tight  bo  :;»is  un  the  slowdown  required  for  an  array  to  emulate  a  tree,  array  or  butterfly.  In 
Section  3.4,  we  prove  that  there  is  a  work-preserving  emulation  of  bounded-degree  tree*  by 
complete  binary  trees  with  O(loglogjV)  slowdown.  We  also  give  evidence,  but  no  proof,  that 
there  is  no  corresponding  real-time  emulation  for  this  class.  (Proving  that  a  complete  binary 
tree  cannot  emulate  a  complete  ternary  tree  in  real-time  is  one  of  several  challenging  question* 
left  open  in  this  chapter.) 

In  Section  3.5,  we  show  that  an  AMioda  butterfly  can  emulate  an  Af-node  mesh  with  slow* 
down  0(loglogAf}.  In  (40)  we  show  that  the  emulation  can  be  performed  in  real-time.  Thi* 
result  is  interesting  because  any  one-to-one  embedding  of  an  array  (with  dimension  2  or  more) 
in  a  butterfly  requires  ft(logA’)  dilation  (11),  which  suggest*  that  any  emulation  must  require 
slowdown  fl(log  N).  The  result  take*  on  added  significance  given  the  fact  that  many  parallel 
numerical  algorithms  are  array-based  while  several  parallel  machines  arc  butterfly-based. 

We  also  describe  a  simple  constant-congestion  embedding  of  an  JV-node  shuffle-exchange 
graph  in  an  /V-node  butterfly  in  Section  3.5.  This  result  has  several  important  consequences. 
First,  it  can  be  used  to  provide  an  elementary  proof  that  the  JV-node  shuffle-exchange  graph  can 
be  laid  out  in  0(N7f  log2  N)  area  and  in  0( A,3f2/  log3/2  N)  volume.  Both  results  are  optimal. 
The  area  bound  was  known  previously  [38],  but  the  proof  was  much  more  difficult  (as  were 
the  proofs  for  several  suboptimal  layouts  for  the  shuffle-exchange  graph  (34,  48,  50,  90)).  The 
3-d  layout  bound  is  new  and  was  not  obtainable  by  any  of  the  previous  approaches  to  the  2-d 
layout  problem.  Second,  we  apply  the  result  to  derive  an  0(log  Ar)-s!owdown  work-preserving 
emulation  of  the  shuffle-exchange  graph  on  the  butterfly. 

In  Section  3.6,  we  prove  the  reverse,  namely,  that  there  is  an  0(logAF)-slowdown  work- 
preserving  emulation  of  the  butterfly  on  the  situ  111  e*cxchange  graph.  Taken  together,  these 
results  come  very  close  to  resolving  a  long  open  question  concerning  whether  or  not  the  butterfly 
and  shuflle-exchangc  graph  are  computationally  equivalent.  In  particular,  we  show  that  up  to 
NC  emulations,  the  butterfly  and  shuffle-exchange  graphs  are  equivalent  in  a  work-preserving 
sense.  Thus,  for  many  problems,  they  can  be  considered  to  be  computationally  equivalent. 

As  a  consequence  of  the  emulations  in  Section  3.6,  we  also  obtain  a  real-time  emulation 
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of  bounded-degree  arrays  in  the  shuffle-exchange  graph,  and  we  show  how  to  sort  N  numbers 
with  high  probability  in  t?(logA')  steps  on  an  Ar-nodc  shuffle-exchange  graph.  Although  the 
proof  of  the  sorting  bound  is  elementary,  it  resolves  an  open  question  concerning  the  difficulty 
of  randomized  sorting  algorithms  on  the  shuffle-exchange  graph.  Previously,  such  an  algorithm 
was  known  for  the  butterfly  (53,  76,  84]  but  that  algorithm  made  crucial  use  of  the  recursive 
structure  of  the  butterfly,  a  structure  not  present  in  a  shuffle-exchange  graph. 

3.1.4  Previous  work 

There  has  been  a  great  deal  of  previous  work  on  graph  embeddings  with  the  intent  of  showing 
that  one  network  can  or  can’t  emulate  another  network  efficiently  (11, 12, 13, 28, 53, 80].  Many 
of  the  results  were  positive  and  proved  things  like  “all  A'-node  binary  trees  can  be  emulated  in 
constant  time  on  an  W-node  hypercube."  There  were  also  some  negative  results,  but  because 
of  the  lack  of  a  good  modci,  their  significance  is  now  less  clear.  For  example,  even  though  an 
embedding  of  a  mesh  into  a  butterfly  requires  dilation  ft(logAf),  we  now  find  that  a  butterfly 
can  emulate  a  mesh  with  constant  slowdown. 

The  notion  of  work-preserving  emulations  in  PRAM  models  has  previously  been  studied 
(42,  67]  and  served  to  motivate  this  work.  Related  problems  of  scheduling  computations  on 
fixed-connection  networks  have  also  been  studied  (72]. 


3.2  Lower  bounds 

In  this  section  we  present  lower  bounds  on  slowdown  and  inefficiency.  Loosely  speaking,  these 
lower  bounds  apply  when  the  guest  graph  expands  faster  than  the  host  graph.  The  first  lower 
bound  can  be  used  to  show  that  any  emulation  of  a  complete  binary  tree  by  a  linear  array  has 
slowdown  0(NH/\ogN„).  The  second  can  be  used  to  show  that  a  butterfly  cannot  perform 
a  work-preserving  emulation  of  an  expander  graph,  that  any  work-preserving  emulation  of  a 
butterfly  by  a  linear  array  II  requires  slowdown  at  least  2n^H\  and  that  any  work-preserving 
emulation  of  a  Jk  +  I-dimensional  mesh  by  a  fc-dimencional  mesh  II  requires  slowdown  at  least 
fi(JV;/fc).  All  of  these  lower  bounds  on  slowdown  are  tight. 

Before  proving  the  lower  bounds,  we  need  to  introduce  some  notation.  For  an  undirected 
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graph  G  =  (V,i3),  let  6(u,v)  be  the  length  (number  of  edges)  of  the  shortest  path  between 
nodes  u  and  ti  in  G.  Let  ffc(u,i)  =  (t>  €  K|$(u,u)  <  i)  be  the  set  of  node*  within  a  distance  i 
of  u  in  G  and  let  fcc(u,i)  =  |#gOm)I*  We  call  bo  the  growth  function  of  G. 

3.2.1  Distance-based  lower  bound 

The  following  theorem  shows  that  if  the  guest  graph  grows  faster  than  the  host  graph,  then 
any  emulation  of  the  guest  by  the  host  must  be  slow. 

Theorem  52  Let  II  =  (V//,  Eh)  be  an  Nn-node  host  graph  and  G  =  £c)  be  an  No-node 

guest  graph,  and  suppose  that  there  are  integers  th  and  to  such  that 

th  ra 

c  "  til  ' 

Then  any  emulation  qJTg  >  TG  steps  of  G  by  It  has  sloicdoum 

S  >  ( th  +  l)/2r(j. 


Proof:  The  basic  idea  is  to  find  a  sequence  of  Tq/tg  pebbles  in  any  Tc-step  pebble  DAG  of  G 
such  that  each  pair  of  pebbles  is  separated  by  at  most  rc  guest  time  steps  but  arc  created  in  II 
at  least  th  host  time  steps  apart.  As  we  shall  see,  such  a  sequence  exists  only  if  the  slowdown 
S  ~  Th/Tg  is  at  least  (r//  +  l)/2r<y. 

We  start  the  sequence  with  the  last  pebble  created  by  II.  Suppose  that  at  time  T//  some 
node  tio  €  V)/  creates  a  pebble  for  DAG  node  (vo.fo),  where  to  —  Tq •  The  pebble  for  (t\j,  /o) 
cannot  be  created  by  II  until  pebbles  for  all  of  its  predecessors  in  the  DAG  are  created.  In 
particular,  there  arc  at  least  j  bc{vo,j)  precedessors  for  time  steps  to-rc  through  fo — 1-  We 

want  to  show  that  the  pebble  for  at  least  one  of  these  predecessors  must  have  been  created  by 
the  host  graph  before  time  Th  —  r//.  The  pebble  for  every  predecessor  of  (t\),<o)  that  is  created 
at  distance  i  from  uo  in  II  must  be  created  at  or  before  time  Th— i.  Thus  at  most  h//(uo,  i) 
pebbles  for  predecessors  of  (t^,<o)  are  created  by  II  between  time  steps  Th  —  w  and  Th  -  1. 
Since  maxugyw  Tl}~\  &//(“. 0  <  min„€Kc  £J=i  &c(w>j)>  the  pebble  for  some  predecessor 
ti  >Tg  —  tq ,  must  be  created  by  the  host  graph  at  or  before  time  T//  -  (r//  + 1). 
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Wc  can  repeat  the  argument  to  find  a  pebble  for  a  predecessor  (ej,  tj),  t  >  >  To  -  2 re,  of 
(wi,  <i)  that  must  be  created  by  the  host  at  or  before  time  Tu  -2(r//  + 1),  and  so  on.  Eventually 
we  obtain  a  pebble  (w*,  <*)  such  that  r<?  >  1*  >  To  -  This  pebble  must  be  created  by  the 
host  at  or  before  time  Tu  -  k(ru  + 1).  Wc  assume  that  input  pebbles  arc  created  at  host  time 
step  0,  and  that  the  emulation  begins  with  time  step  1.  Thus,  Tu  -  k(ju  + 1)  >  0.  Combining 
these  inequalities,  we  have 

ThITq  >  ( T,{  +  l)/2rc 

for  To  >  *0-  E 

Corollary  55  Any  such  emulation  hat  inefficiency 

Corollary  54  Any  emulation  of  a  complete  Unary  tree,  G,  by  a  k-dimensional  mesh,  II,  has 
slowdown  at  least  ft  ((JVc/log*  AfcJ'AHO). 

Proof:  Apply  Theorem  52  with  tq  =  ©(SogiVc)i  »nd  tu  -  0  ((JV<?log  JVg)1/!**1)).  □ 

3.2.2  Congestion-based  lower  bound 

The  second  lower  bound  requires  a  little  more  notation.  Let  G  —  (V,E)  be  an  undirected  graph 
as  before.  For  a  set  U  C  V,  we  define  the  i-ncighborhood  of  U  to  be  the  set  of  nodes  within  a 
distance  i  of  some  node  in  U,  M,{U)  -  Uuei/Bo(u,  i)-U.  We  define  an  ( R ,  /(.^-decomposition 
of  G  to  be  a  partition  of  V'  xuio  |V'|/Ji  sets  of  nodes  (regions)  such  that  each  contains  R  nodes 
and  has  a  1-neighborhood  of  size  at  most  f(R). 

The  last  graph  parameter  that  we  need,  zg,  is  best  described  in  terms  of  a  simple  game. 
The  player  starts  by  choosing  a  nodes  of  a  connected  grapli  G  and  placing  them  in  a  bag.  The 
player  is  given  a  collection  of  ea,  0  <  e  <  1,  tokens  to  play  with.  The  game  is  played  in  rounds, 
each  consisting  of  two  steps.  In  the  first  step,  all  of  the  neighbors  of  the  nodes  in  the  bag  arc 
added  to  the  bag.  In  t>e  second  step,  the  player  may  exchange  tokens  for  nodes  in  the  bag  on 
a  one-for-one  basis.  Let  X ;  be  the  set  of  nodes  in  the  bag  at  the  end  of  round  i,  and  let  Y,  be 
the  set  of  nodes  removed  in  the  second  step  of  round  Then  X,-  is  given  by  the  recurrence 
X;  =  Xi- 1  +  A/i(A',-_i)  -  Y;.  The  game  ends  when  the  number  of  nodes  in  the  bag  exceeds 
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it  capacity,  c,  at  the  end  of  a  step,  where  c  <  Ng-  If  fc  is  the  number  of  rounds  played,  then 
|A'{|  <  c  for  i  <  k,  |A',*|  >  c  for  i  =  k%  and  l^'l  ^  £«-  The  goal  is  to  play  as  many  rounds 
as  possible.  Let  rc(o,£, c)  be  an  uppe*  ^ound  that  is  non-increasing  in  a  on  the  length  of  the 
longest  possible  game. 


Theorem  55  Suppose  that  II  =  is  an  Nn-nodc  host  graph  with  an  (R,f(R))- 

decomposition,  and  that  G  —  (Vg,Eg)  an  No-node  guest  graph.  Let 

a  f  (N't  a  3jVc\  (WgR  I  Na\\ 

P  =  »»ax  4  ,0,-4  2  }}• 

Then  for  any  emulation  of  G  by  H  where  Tg  >  3 p, 

r  „f„  r  H  N)l\ 

1  -mm  { npf{R)'%R  f  • 


Proof:  The  basic  strategy  is  to  show  that  cither  the  host  spends  a  lot  of  time  passing  pebbles 
across  the  perimeters  of  the  regions  in  the  (,K, /(/^-decomposition,  or  the  host  spends  a  lot 
of  time  creating  pebbles.  We  will  break  the  Tg  guest  time  steps  into  blocks  of  3 P  consecutive 
steps  and  classify  every  block  as  either  an  importer  or  a  creator.  If  a  block  is  an  importer,  then 
many  pebbles  for  the  block  cross  region  perimeters.  If  a  block  is  a  creator,  then  some  region 
creates  many  pebbles  for  the  block.  If  the  majority  of  the  blocks  arc  importers,  then  the  time 
required  by  the  host  to  pass  pebbles  across  the  perimeters  of  the  regions  large.  Otherwise,  the 
time  required  to  create  the  pebbles  is  large. 

Before  we  can  get  started  we  need  one  more  piece  of  notation.  For  each  node  v  in  G  there  is 
at  least  one  pebble  created  by  II  for  each  guest  time  step  t  between  1  and  To.  The  first  pebble 
created  for  u  for  time  t  is  called  the  t-primary  pebble  for  v.  For  each  value  of  t  there  arc  exactly 
Ng  /-primary  pebbles.  The  /-primary  pebbles  are  ordered  according  to  the  order  in  which  they 
arc  created  by  //,  with  ties  broken  arbitrarily.  We  call  the  first  3Arc/'l  /-primary  pebbles  the 
t-carly  pebbles  and  the  last  3 Afc/d  the  t-late  pebbles. 

We  begin  with  the  definition  an  importer  block.  Consider  a  block  from  step  /  to  /  -  3/?  + 1. 
The  average  number  of  /-early  pebbles  created  by  each  of  the  Nh/R  regions  in  the  decomposition 
of  II  is  at  least  p  =  3 NgR/4N}{.  We  say  that  a  region  is  i-busy  if  it  creates  at  least  p/2  /-early 
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pebbles.  We  say  that  a  t-carly  pebble  is  t-busy  if  it  is  created  by  a  t-busy  region.  At  least 
half  of  the  f-early  pebbles  arc  f-busy.  Thus,  there  are  at  least  3Afc/S  f-busy  pebbles.  Suppose 
that  a  1-busy  region  creates  s  >  p/2  1-busy  pebbles.  We  say  that  the  region  is  an  importer  if 
it  imports  at  least  s/2  pebbles  for  time  steps  between  1  —  1  and  1  -  2 0.  We  say  that  a  block 
is  an  importer  if  every  1-busy  region  is  an  importer,  or  if  some  region  imports  at  least  3JVg/16 
pebbles  for  time  steps  between  1  -  1  and  1  -  20.  In  a  importer  block,  a  total  of  at  least  3A'c/16 
pebbles  for  time  steps  between  1-1  and  1  -  2/3  are  imported  by  all  of  the  regions. 

If  at  least  half  of  the  Ta/ 30  blocks  are  importers,  then  we  can  find  a  lower  bound  on 
inefficiency  by  computing  the  time  required  to  import  pebbles.  In  this  case,  the  total  number 
of  pebbles  imported  by  all  of  the  importer  blocks  is  at  least  TqNg/320.  The  host  time  required 
to  import  these  pebbles  is  at  least  T//  >  Tq Ng R/320 Nh /( R) ,  because  at  each  host  time  step, 
each  of  the  Nn/R  regions  can  import  at  most  f(R)  pebbles.  In  this  case, 

/  >  R/320 f(R). 

• 

As  we  shall  sec,  if  a  block  is  not  an  importer  then  some  region  must  create  many  pebbles  for 
the  block.  Hence  the  name  creator.  In  a  creator  block  there  must  be  some  1-busy  region  ft  that 
creates  s  >  p/2  1-busy  pebbles  but  im-  orts  fewer  than  s/2  pebbles  for  time  steps  between  1-1 
and  1  -  20.  The  1-busy  pebbles  created  by  ft  cannot  be  created  until  pebbles  for  all  of  their 
predecessors  in  the  pebble  DAG  are  created.  Since  rc(s,  1/2,  Nq/2)  <  :q(p/ 2,  1/2,  JVg/2)  <  0 , 
ft  imports  at  most  s/2  pebbles  for  time  steps  between  1-1  and  1  —  *<;(•*»  l/2,Wc/2).  Thus 
ft  must  create  at  least  Nq/2  pebbles  for  time  step  1  -  1/2,  Arc/2).  Furthermore,  since  ft 

imports  at  most  3Wg/1G  pebbles  for  time  steps  between  1-1  and  1  -  20 ,  it  must  create  at  least 
5Arc/16  pebbles  for  every  time  step  between  1  -  1/2,  Ara/2)  and  1  -  20.  For  each  of  these 

time  steps,  at  least  Ng/  16  of  the  pebbles  arc  created  for  nodes  whose  (1  -  2/?)-primary  pebbles 
arc  (1  —  2/?)-late  pebbles.  We  call  these  pebbles  the  descendant  pebbles. 

We  have  chosen  the  descendant  pebbles  so  that  none  are  created  by  //  until  all  of  the 
descendant  pebbles  for  previous  blocks  have  been  created.  The  early  pebbles  for  all  time  steps 
at  or  before  l-2/3-rc(ArG/4,0,3ArG/4)  must  be  created  before  the  (f-2/?)*late  pebbles  because 
3 Ng/A  nodes  in  G  lie  within  a  distance  *G(Afc/4,0,3ArG/4)  of  the  nodes  corresponding  to  the 
first  Arc/4  (1  -  2/?)-primary  pebbles.  Since  2G(Arc/4,0,3Arc/4)  <  0,  the  early  pebbles  for 
previous  blocks  must  be  created  before  the  (t  —  2/?)-late  pebbles.  Furthermore,  the  (t  —  2/?)-late 
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pebbles  must  be  created  before  the  descendant  pebbles,  which  in  turn  must  be  created  before 
the  f-busy  pebbles  for  71. 

If  at  least  half  of  the  blocks  are  creators,  then  we  can  derive  a  lower  bound  on  inefficiency 
by  summing  the  time  to  create  the  descendant  pebbles  for  each  of  the  creator  blocks.  For  each 
of  Tg/GP  creator  blocks,  at  least  PNg/\G  descendant  pebbles  arc  created  by  a  single  region. 
The  host  time  for  each  block  is  at  least  PNg/IGR.  The  host  time  for  all  of  the  creator  blocks 
is  at  least  TgNg/OGR  and  the  inefficiency  is  at  least 

I  >  Nu/OGR. 

Combining  the  two  cases  proves  the  theorem.  □ 

Corollary  50  A  k-dimcnsional  meth  II  cannot  perform  a  work-preserving  emulation  of  an 
expander  graph  G. 

Proofs  Apply  Theorem  55  with  R  =  0((AT//  log  N//)k^k+i^)}  f(R)  =  and  p  = 

0(log(Af////t)).  The  inefficiency  is  at  least  I  >  fl((iV/// log*  □ 

Corollary  57  A  butterfly  network  H  cannot  perform  a  work-preserving  emulation  of  an  ex - 
jMiulcr  graph  G. 

Proof:  Apply  Theorem  55  with  R  =  ©(Afy/loglog Afy//log Ar//),  f(R)  =  0(R/lo^R)t  and 
P  -  0(Iog(AT///#)).  The  inefficiency  is  at  least  />  fl(log  A'/// log  log  AT//).  □ 

Corollary  58  Any  work-preserving  emulation  of  a  butterfly  G  by  a  k-dimcnsional  mesh  Jl  has 
slowdown  at  least  2°^w  ). 

Proof:  Apply  Theorem  55  with  R  =  0((AT//log  Afc)fc^fc+^),  f(R)  —  and  fl  = 

0(log  Ng).  The  inefficiency  is  at  least  I  >  n((Ar/r/logfc  A'c)1^**1)).  □ 

Corollary  59  Any  work-preserving  emulation  of  a  j-dimensional  mesh  G  by  a  k-dimcnsional 
mesh  II ,  j  >  k,  has  slowdown  at  least  fl(Njf~tl^k). 


Proof:  Apply  Theorem  55  with  R  =  Q([Nq}  Nj[)kKk+l)),  f{R)  —  0(R^k~l^k),  and  P  = 
The  inefficiency  is  at  least  /  >  0((Nj{f  Nq)1^^).  □ 
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3.3  Emulations  by  arrays 

Although  the  arrayi  cannot  perform  real-time  emulations  of  graphs  with  small  diameter,  we 
can  show  that  they  can  perform  work-preserving  emulations  of  complete  binary  trees,  other 
arrays,  and  butterflies.  In  each  case,  we  find  an  embedding  of  the  guest  graph  into  the  array 
with  acceptable  load,  congestion,  and  dilation.  The  edges  of  the  guest  graph  are  emulated  by 
routing  packets  between  the  nodes  of  the  linear  array.  All  of  the  following  results  can  be  shown 
to  be  tight  by  Corollaries  54,  58,  and  59. 

Observation  60  An  N-nodc  k-dimensional  mesh  can  perform  a  work-preserving  emulation  of 
an  /  log  N -node  complete  binary  tree. 

Proof:  An  JV‘C*+1)/*)y  logA'-nodc  complete  binary  tree  can  be  embedded  in  an  iV-node 
^-dimensional  mesh  with  load  0(N^k/logN),  dilation  0(AM/*/log JV),  and  congestion 
(?(*  «/(*+»).  □ 

Observation  61  An  N-node  k-dimensional  mesh  can  perform  a  work-preserving  emulation  of 
an  N’^-nodc  j-dimcnsional  mesh,  j  >  k. 

Proof:  An  A^-node  j-dimensional  mesh  can  be  embedded  in  an  JV-node  k-dimensional  mesh 
with  load  NU~k)/k,  congestion  AfO"-*)/*,  and  dilation  1.  □ 

Observation  62  An  Nn  -  nk-node  k-dimensional  mesh  JI  can  perform  a  work-preserving 
emulation  of  an  Ng  =  n2n-nodc  butterfly  graph  G. 

Proof:  An  n2n-node  butterfly  graph  with  2n  rows  and  n  columns  can  be  embedded  in  a 
Nu  —  nfc-node  k-dimensional  mesh  with  load  0(2n/n*_1),  congestion  0(2n/nk~ *),  and  dilation 
0(n).  □ 

It  is  interesting  to  note  that  every  connected  network  can  perform  a  real-time  emulation  of 
a  linear  array.  Hence,  Observations  GO  through  02  can  be  modified  to  hold  for  all  connected 
networks. 
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3.4  Emulations  by  complete  binary  trees 

3.4.1  Work-preserving  emulations  of  bounded-degree  trees 

In  this  section,  we  show  that  any  N  log  log  A?*node  forest  with  maximum  degree  A  can  be  embed¬ 
ded  in  an  Af-nodc  complete  binary  tree  with  load  0(AloglogAr),  congestion  0(Aaloglog  N), 
and  dilation  O(logA).  As  a  corollary,  there  is  a  work-preserving  emulation  with  slowdown 
0  (log  log  A')  of  the  class  of  bounded-degree  forests  by  the  class  of  complete-binary  trees. 

In  constructing  the  embedding,  we  use  the  following  well-known  weighted-separator  lemma 
and  its  corollaries. 

Lemma  S3  Suppose  that  F  =  (V,  E)  is  a  forest  where  each  vertex  has  been  assigned  some 
non-negative  weight.  Then  it  is  possible  to  remove  a  single  vertex  from  V  so  that  the  remaining 
vertices  can  be  partitioned  into  two  subforests  Fi  and  Fs  such  that  no  edge  connects  a  vertex  in 
F}  with  a  vertex  in  Fj,  and  F>  and  Fa  each  contain  at  most  2/3  of  the  total  weight. 

Proof:  Omitted. 

Corollary  64  By  removing  a  single  vertex,  it  is  possible  to  partition  a  forest  F  =  (VyE)  into 
two  subforests  each  containing  at  most  2|V|/3  vertices. 

Proof:  Assign  each  vertex  weight  1  and  apply  Lemma  63.  □ 

Corollary  65  By  removing  a  set  S  ofk  vertices,  it  is  possible  to  partition  a  forest  F  =  (V,  E) 
into  two  subforests,  F\  and  Fa,  each  containing  at  most  |K|(1  +  (2/3)*)/2  vertices. 

Proof:  Initially  Fj  and  Fa  arc  empty  and  a  third  set  R  contains  all  of  the  vertices.  Iterate 
the  following  step  k  times.  Apply  Corollary  61  to  split  Jl  into  two  subforcsts,  then  remove  the 
smaller  subforcst  from  R  and  add  it  to  the  smaller  of  Fj  and  Fa.  At  the  end  of  each  step,  F\ 
aud  Fa  differ  in  size  by  at  most  |j?|.  After  k  iterations,  R  contains  at  most  |Vr|(2/3)*  vertices. 
Add  R  to  the  smaller  of  the  two  sets.  □ 

Corollary  66  Suppose  that  F  =  (V,  E )  is  a  forest  where  each  vertex  has  been  assigned  some 
non-negative  weight.  Then  it  is  possible  remove  a  set  S  of  k  vertices  such  from  V  such  that  the 
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remaining  vertices  can  be  partitioned  into  two  subforests  F\  and  Ft  such  that  no  edge  connects 
a  vertex  in  Pi  with  a  vertex  in  Ft,  and  each  contains  at  most  |V'|(1  +  (2/3)t*“WJ)/2  vertices 
and  at  most  5/6  of  the  total  weight. 

Proof:  First  apply  Lemma  63  to  partition  the  forest  into  two  subforests  L  and  R,  each  con* 
taining  at  most  2/3  of  the  weight.  Next,  apply  Corollary  65  to  split  L  into  L\  and  Lt ,  and  R 
into  /ti  and  Rt.  Let  L\  and  Ri  have  more  weight  than  Lt  and  Rt  respectively.  Then  both  L\ 
and  Rt  have  at  most  2/3  of  the  weight,  and  Lt  and  Rt  have  at  most  1/6.  Let  Pi  =  L\  U  Rt 
and  Ft  —  Lt  U  R\.  □ 

With  these  tools  in  hand,  we  present  the  embedding. 

Theorem  07  An  N  log  log  N-node  forest  with  maximum  degree  A  can  be  embedded  in  an  N- 
node  complete  binary  tree  with  load  I  =  O(AloglogN),  congestion  c  =  0(Aaloglogtf),  and 
dilation  d  —  O(logA). 

Proof:  The  embedding  begins  by  using  Corollary  66  to  find  a  set  S  of  k  €  0(loglog  Af)  nodes 
that  partitions  the  forest  F  =  (V,J3)  into  two  subforests,  each  containing  at  most  |V|(1  + 
1/log  JV)/2  vertices.  We  embed  S  at  the  root  of  the  binary  tree  and  then  recursively  embed 
one  or  the  subforests  in  the  left  subtree  of  the  root,  and  the  other  in  the  right. 

At  levels  below  the  root,  we  use  Corollary  66  to  simultaneously  partition  the  vertices  of  the 
forest  and  the  edges  connecting  the  forest  to  vertices  that  arc  embedded  higher  in  the  binary 
tree.  Let  P;  =  (Vj,  £,)  be  a  forest  to  be  embedded  in  a  subtree  rooted  at  a  level  i  node  v;  in  the 
binary  tree.  Let  N{  be  the  number  of  edges  connecting  P;  to  vertices  embedded  higher  in  the 
binary  tree;  N{  is  the  congestion  of  the  binary  tree  edge  connecting  v,-  to  its  parent.  We  assign 
each  vertex  of  P;  a  weight  equal  to  the  number  of  neighbors  it  has  that  are  embedded  higher 
in  the  binary  tree.  Using  Corollary  66,  we  find  a  set  5,-  of  k  vertices  that  partitions  F;  into  two 
subforests,  each  of  size  at  most  |VJ|(1  +  1/log  N)/2,  and  each  having  at  most  (5/6'*r  ?dgos  to 
vertices  that  are  embedded  higher  in  the  tree.  We  embed  the  vertices  of  S{  at  v;  and  recursively 
embed  one  of  the  subforests  in  the  left  subtree  of  t?,-,  and  the  other  in  the  right  subtree. 

To  limit  the  dilation  to  some  integer  d ,  whenever  i  is  a  multiple  of  d  we  embed  at  not 
only  Si  but  also  all  of  the  vertices  in  P;  that  have  at  least  one  neighbor  embedded  somewhere 
higher  in  the  binary  tree. 
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\Vc  must  now  show  how  to  choose  d  so  that  both  the  congestion  and  the  load  of  the  embed¬ 
ding  arc  small.  Consider  any  simple  path  from  a  level  i  node  v,  in  the  binary  tree  to  a  level  d 
node,  Vi+j,  where  i  is  a  multiple  of  d.  At  level  i,  we  embed  a  separator  of  size  k  and  at  most  An¬ 
other  vertices  that  have  at  least  one  neighbor  embedded  higher  in  the  tree.  Since  each  of  these 
vertices  has  at  most  A  neighbors,  A',+t  <  Ail*  +  A Af,\  At  level  i  + 1,  we  embed  a  separator  of 
size  k  that  partitions  F{+\  into  two  subforests,  each  having  at  most  (5/6)Af,+j-  edge*  to  vertices 
embedded  higher  in  the  binary  tree.  Thus,  at  level  i  +  2,  we  have  A'.+j  <  (5/6 )iV,+»  +  A k.  In 
general,  iVl+J  is  given  by  the  recurrence 

*+i  £  (  + 

[  (5/6)W;w., +AJ:  1  <  j  <  rf 

Solving  the  recurrence  yields 

Ar;+i<6Ait  +  (5/6)>-lAtf(*. 

■ 

We  are  now  in  a  position  to  calculate  the  load  and  the  congestion.  The  preceding  argument 
shows  that  for  d  €  O(log  A)  and  A?,*  6  O(AJk),  we  have  jV,*+j  <  JVj.  Thus,  in  every  simple 
path  between  a  node  at  level  i  and  a  node  at  level  i  +  d,  where  i  is  a  multiple  of  A,  the 
congestion  starts  at  O(Afc)  at  level  f,  rises  to  at  most  0(A5Jfc)  at  level  i  +  1  and  proceeds  to 
drop  back  down  to  at  most  O(AJk)  at  level  i  +  d.  Thus,  the  congestion  of  the  embedding  is  at 
most  0(A2loglog  A').  How  large  can  the  load  be?  At  each  node  of  the  binary  tree  we  embed 
a  separator  of  size  k.  For  every  t  that  is  a  multiple  of  d,  we  also  embed  a  set  nodes  of  size 
A';  —  O(AJfc).  Finally,  at  th'j  leaves  we  embed  forests  of  size 

AT  !oglogA'((l+  1/log  Ar)/2)k** 

which  is  at  most  O(ioglogAT).  Thus  the  load  is  at  most  0(Aloglog  jV).  □ 

Corollary  68  There  is  a  work-preserving  emulation  of  the  class  of  bounded-degree  forests  by 
the  class  of  complete-binary  trees  with  sloiodown  O(loglog  N). 


3.4.2  Congestion  lower  bound  for  complete  ternary  trees 

In  this  section  we  show  that  any  embedding  of  an  AMeaf  complete  ternary  tree  I3  in  an  A/-leaf 
complete  binary  tree  Tz,  N  <  M  <  ZN,  in  which  the  leaves  of  T3  are  mapped  to  the  leaves  of 
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7b  with  load  at  most  T*0**^,  fixed  a  <  1,  has  congestion  at  least  fi(%/Ioglog  A').  This  lower 
bound  suggests,  but  does  not  prove,  that  real-time  emulation  of  a  complete  ternary  tree  by  a 
complete  binary  tree  is  impossible. 

Theorem  60  Any  embedding,  of  an  N-teaf  complete  tertiary  tree  7b  in  an  At -leaf  complete 
binary  tree  7b,  N  <  M  <  3N,  in  which  the  leaves  o/7b  art  mapped  to  the  leaves  of'l\  with 
load  l  =  ’P***1*,  fixed  o  <  1,  has  congestion  at  least  ft(>/Iog  log  N"). 

Proof:  The  proof  has  the  following  outline.  Let  L  denote  the  number  of  leaves  of  7b  In  a 
subset  S  of  the  nodes  of  7s,  and  let  w  be  a  base-3  string  representing  L.  First  we  show  that 
for  any  5,  the  number  of  l’s  in  w  is  at  most  one  plus  the  number  of  edges  between  S  and  5. 
As  a  consequence,  if  S  is  the  set  of  nodes  mapped  to  a  subtree  rooted  at  a  node  v  in  7a,  then 
the  congestion  on  the  edge  from  the  u  to  its  parent  is  at  least  as  large  as  the  number  of  l’s  in 
u>.  Next,  we  construct  a  path  t*>,  i>i , . . . ,  u^A/  in  7j  from  the  root  vq  to  a  leaf  t** \t  such  that 
there  is  a  long  sequence  of  nodes  on  the  path,  vj,  vj+i, . . . ,  uJ+J_j  such  that  for  each  tv,  where 
j  <  s  <  j  +  j  ~  1,  the  number  of  leaves  of  7b  mapped  to  the  left  and  right  subtrees  of  u,  are 
nearly  equal.  Let  Si  be  the  set  of  nodes  of  7a  mapped  to  the  subtree  rooted  at  v;,  let  Li  be  the 
number  of  leaves  of  7b  in  5,*,  and  let  w,-  be  the  basc-3  string  representing  To  complete  the 
proof  we  show  that  for  some  i,  where  j  <  i  <  j  +  s  —  1,  there  arc  many  l’s  in  w;. 

First  we  show  that  for  any  subset  S  of  the  nodes  of  7b,  the  number  of  l’s  in  w  is  at  most 
|£$|  +  1,  where  Es  is  the  set  of  edges  in  7b  connecting  a  node  in  5  to  a  node  in  The  key 
idea  is  that  L  can  be  expressed  as  a  scries  of  |£s|  + 1  terms,  both  positive  and  negative,  where 
each  term  is  a  perfect  power  of  3.  If  the  root  of  7b  belongs  to  5,  then  the  series  begins  with 
the  term  A’;  otherwise  it  begins  with  0.  Thereafter,  each  edge  in  Es  contributes  a  term  to  the 
series.  An  edge  between  a  node  u  on  level  l  and  its  parent  on  level  /  - 1  contributes  N/31  if  u  is 
in  5,  and  -N/31  otherwise.  Because  adding  or  subtracting  a  power  of  3  can  produce  at  most 
one  1  digit  in  a  base-3  number,  the  number  of  1’s  tu  is  at  most  |£($)|  +  1. 

Starting  at  the  root,  vo,  we  construct  the  path  in  7b  according  to  the  following  rule.  Suppose 
that  t>;  is  a  node  on  the  path.  Then  the  next  node  on  the  path,  t>;+i  is  the  root  of  the  left  oi 
right  subtree  of  w;  containing  more  leaves  of  7b.  Let  £,•  be  the  number  of  leaves  of  7b  mapped 
to  the  subtree  rooted  at  Then  u,-+i  contains  at  least  L;/2  leaves  of  7b.  We  call  the  split  at 


132 


CHAPTER  3.  WORK-PRESERVING  EMULATIONS 


Vi  fair  if  both  of  its  subtrees  contain  at  most  /,,( 1/2  + c)  leaves  of7-j,  where  t  will  be  specified 
later. 

Next  we  put  a  lower  bound  on  the  length  of  the  longest  sequence  of  consecutive  fair  split*. 
Let  h  be  the  number  of  unfair  splits  on  the  path.  The  number  of  leaves  of  7*3  mapped  to  the 
leaf  at  the  end  of  the  path  is  at  least 

ar“(H‘ 

Since  the  load  is  at  most  f,  and  1  +  z  <  t*!2  for  0  <  x  <  1,  we  have  /  >  ^eck.  Let  $  be  the 
length  of  the  longest  sequence  of  consecutive  fair  splits.  Then  *  >  log  M/b  >  clog  A// In  3/. 

We  now  show  that  in  the  longest  sequence  of  consecutive  fair  splits  iy, 
there  must  be  a  node  v,,  where  l<i<j+s-l  such  that  there  are  many  l's  in  uy.  For  the 
moment,  let  us  assume  that  at  each  node  ty  on  the  sequence,  the  number  of  leaves  of  7*3  mapped 
to  each  subtree  of  e,-  is  exactly  Li/2.  Then  we  can  prove  that  at  some  node  t;  on  the  sequence, 
the  number  of  l’s  in  the  t  most  significant  digits  of  w is  at  least  VF,  where  t  -  (log]  2),«. 
Suppose  that  the  the  number  of  l’s  in  uy  is  smaller  than  \fl  (otherwise  we’re  done).  The  l’s 
in  uy  partition  it  into  substrings  consisting  of  0’s  and  2’s  only.  In  each  substring,  division  by  2 
either  converts  all  of  the  0’s  to  l’s  (leaving  the  2*s  unchanged),  or  converts  all  of  the  2’s  to  l’s 
(leaving  the  0's  unchanged).  Thus,  after  division  by  2,  0’s  and  2’s  arc  adjacent  in  at  most  y/i 
places  in  uy+j.  Thus,  there  must  be  a  substring  of  either  y/i  O’*  or  y/i  2’s  in  uy+>.  In  cither 
case,  after  at  most  s  divisions  by  2  the  substring  is  converted  to  all  l’s. 

Unfortunately,  a  fair  split  at  a  node  v,-  docs  not  divide  Li  exactly  by  2;  it  also  adds  as  much 
as  cL{.  For  t  <  1/3*,  adding  sL{  does  not  change  the  t  most  significant  bits  unless  a  carry 
propagates  in.  We  need  to  show  that  our  substring  of  y/i  0’s  or  2’s  is  not  adversely  affected 
by  carries.  Since  a  carry  into  a  substring  of  2’s  turns  them  all  into  0’s,  wc  need  only  consider 
the  effect  of  a  carry  into  a  substring  of  0’s.  A  carry  into  a  substring  of  0’s  converts  the  least 
significant  0  in  the  substring  into  a  1,  which  is  bad,  because  it  reduces  the  length  of  the  string. 
However,  3^2  carries  are  required  to  modify  the  y/i/ 2  least  significant  0's  in  the  substring. 
Since  at  most  one  carry  occurs  at  each  of  the  s  splits,  and  s  <  3^2,  the  length  of  the  longest 
string  of  0’s  never  drops  below  y/i/ 2. 

To  finish,  we  choose  values  for  £,  /,  and  t.  To  make  the  lower  b<_..nd  strong,  we  want  to 
make  l  large  without  making  l  too  small.  For  any  fixed  a  <  1,  we  can  choose  l  =  2lo‘"JV, 
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i  *=  0(logiog  N),  and  £  =  1/3*.  The  congestion  is  at  least  V?/2  =  ft(vloglogA^).  □ 

3.5  Emulations  by  butterfly  networks 

3.5.1  Work-preserving  emulations  of  binary  trees 

When  the  Bhatt,  Chung,  Hong,  Leighton,  Rosenberg  result  (11)  that  a  butterfly  can  emulate  a 
complete  binary  tree  in  real-time  is  combined  with  the  material  in  Section  3.4,  we  find  that  there 
is  an  0(log)og  jV)-timc  work-preserving  simulation  of  the  class  of  binary  trees  on  the  butterfly. 
Whether  or  not  this  emulation  can  be  performed  in  real-time  remains  an  open  question. 

3.5.2  Emulation  of  meshes 

In  this  section  we  show  that  an  0(Af)-node  butterfly  can  emulate  an  jV~nodc  mesh  with  slow¬ 
down  O(loglogJV). 

Theorem  70  An  0{N)-t\o<te  butterfly  cm  emulate  T  steps  oj  n  y/fi  x  y/fi  mesh  in 
0{T  log  log  A’)  steps. 

Proof:  The  trick  is  to  divide  the  mesh  into  slightly  overlapping  submeshes,  as  shown  in  Figure  3- 
1.  Each  log5  Axlog5  A  submesh  overlaps  its  neighbors  in  cither  2 log  A'  rowsor21ogA'  columns. 
Since  the  submeshes  overlap,  some  mesh  nodes  appear  in  as  many  as  four  submeshes.  We  call 
two  nodes  in  neighboring  submeshes  mates  if  they  correspond  to  the  same  mesh  node.  Each 
submesh  is  emulated  by  a  different  ©(log1*  Af)-node  subbutterfly.  Since  a  single  mesh  node  may 
be  emulated  by  several  subbutterflies,  the  butterfly  performs  redundant  computation. 

A  subbutterfly  emulates  the  corresponding  submesh  by  routing  packets  between  each  mesh 
node  and  its  neighbors.  Since,  an  ©(log*  JVJ-nodc  subbutterfly  can  route  any  permutation 
of  0(log4A)  packets  in  O(loglogA')  steps,  the  time  to  emulate  each  step  of  the  submesh  is 
0(loglogAf). 

The  nodes  on  the  borders  of  a  submesh  cannot  be  emulated  by  the  corresponding  subbut¬ 
terfly  because  they  require  inputs  from  mesh  neighbors  that  the  subbutterfly  does  not  emulate. 
As  a  consequence,  nodes  at  distance  S  from  the  border  can  be  emulated  for  only  6  steps.  For¬ 
tunately,  every  node  at  distance  6  <  log  A  from  the  border  of  one  submesh  has  a  mate  at 
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k*V/  2  log  A' 


Figure  3*1:  The  division  of  the  mesh  into  submeshes.  Each  log5  N  X  log5  N  submesh  overlaps 
its  neighbors  in  either  21ogjV  rows  or  2  log  columns. 


* 

a  distance  of  2  log  N  -  8  >  log  N  in  a  neighboring  submesh.  Thus,  every  mesh  node  can  bf* 
emulated  for  the  full  log  Ar  steps  in  some  subbutterfly. 

To  emulate  T  >  logiV  steps  of  the  mesh,  the  T  steps  arc  broken  into  blocks  of  logJV 
consecutive  steps.  The  time  to  emulate  a  block  of  log  N  steps  is  0(log  N  log  log  A').  Before  the 
next  block  can  be  emulated,  the  nodes  within  distanve  log  Ar  of  the  borders  of  the  submeshes 
must  be  updated  by  their  mates.  Since  an  A'-nodc  butterfly  can  route  any  permutation  of 
Ar*packets  in  O(logAf)  steps,  the  updating  takes  O(logAf)  time.  The  total  time  forT  steps  is 
0(T  loglogtf).  □ 

This  emulation  scheme  has  two  main  drawbacks.  First,  the  packets  that  are  sent  between 
blocks  to  update  the  mates  must  each  carry  enough  information  to  reflect  the  change  in  the 
state  of  a  mesh  node  over  a  period  of  logAr  steps.  Such  packets  arc  unreasonably  large.  This 
problem  can  be  overcome  by  observing  that  only  0(N/\ogN)  of  the  mesh  nodes  must  be 
updated.  If  these  nodes  are  carefully  positioned  within  their  subbutterflios,  it  is  possible  to 
route  log  Ar  packets  to  each  of  them  in  O(log  Ar)  steps.  Second,  the  clowdown  is  too  large.  The 
slowdown  can  be  reduced  from  0 (log log  to  0(log*  N)  by  &  each  log5Ar  X  log2  N 

mesh  recursively.  A  more  sophisticated  scheme  for  real-time  emuia  js  presented  in  [40]. 


3.5.  EMULATIONS  BY  BUTTERFLY  NETWORKS 


135 


3.5.3  Embedding  the  shuftle-exchange  graph  in  the  butterfly 

Tn  this  section,  we  show  how  to  embed  an  Af-nodc  shuffle-exchange  graph  in  an  0(A')-nodc 
butterfly  graph  with  constant  load,  constant  congestion,  and  O(logi^)  dilation.  These  graphs 
are  defined  in  Sections  1.7  and  1.5,  respectively. 

A  constant  congestion  embedding  requires  that  very  few  edges  of  the  shuffle-exchange  graph 
be  mapped  to  long  (more  than  constant  length)  paths  in  the  butterfly.  In  addition,  these  paths 
must  not  overlap  each  other  very  often.  To  ensure  this,  we  use  Walcsman’s  observation  |0$| 
that  the  inputs  and  outputs  of  a  Denes  network  can  be  connected  in  any  permutation  by  a  set 
of  disjoint  paths.  That  is,  if  the  set  of  long  paths  can  be  decomposed  into  a  constant  number 
of  (partial)  permutations  of  the  inputs  of  the  butterfly,  the  long  paths  can  be  embedded  with 
constant  congestion.  It  is  easy  to  see  that  we  can  embed  the  long  paths  in  this  manner  when 
there  arc  at  most  a  constant  number  of  endpoints  of  long  paths  in  any  single  butterfly  row.  (We 
first  route  a  path  from  each  endpoint  to  the  input  of  its  row,  which  leaves  us  with  a  constant 
number  of  permutations  to  route  on  the  Denes  network.) 

We  map  the  nodes  of  a  shuffle-exchange  graph  to  the  nodes  of  a  butterfly  graph  so  that 

1.  at  most  a  constant  number  of  shuffle-exchange  nodes  arc  mapped  to  any  one  butterfly 
node,  and 

2.  each  butterfly  row  contains  at  most  a  constant  number  of  shuffle-exchange  nodes  which 
have  any  neighbor  mapped  to  a  distant  node  in  the  butterfly. 

Short  paths  only  contribute  constant  congestion  since  they  have  constant  length.  Long 
paths  only  contribute  constant  congestion  since  we  can  route  any  permutation  with  congestion 
2,  and  we  only  need  to  route  a  constant  number  of  (partial)  permutations.  Also,  the  length  of 
the  short  paths  is  constant  and  the  long  paths  is  O(Iogu). 

In  particular,  we  map  the  nodes  of  a  N  =  2n-nodc  shufllc-cxchange  graph  to  the  nodes  of  a 
(n+2-logri)2n+2"lo*n  «  4iV-nodc  butterfly  graph.  Each  node  in  this  Ar-node  shufllc-cxchange 
graph  has  n  bits  in  its  label.  A  node  in  the  butterfly  can  be  specified  by  a  row  represented  by 
n  +  2  - log  n  bits,  and  a  level  in  the  row.  The  level  in  the  row  corresponds  to  a  bit  that  can  be 
flipped  to  enter  another  row.  Thus,  we  first  associate  a  shuffle-exchange  node  with  a  particular 
row  of  the  butterfly  by  removing  logn  —  1  adjacent  bits  of  its  label  none  of  which  are  the  least 
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significant  bit,  then  we  pick  the  level  in  the  tow  which  corresponds  to  where  the  least  significant 
bit  of  the  shuffle- ex  change  node  appears  in  the  row’s  representation. 

We  map  a  shuffle-exchange  node  uj  to  a  node  in  the  butterfly  as  follows, 

1.  Consider  the  longest  string  of  zeros  in  w  ignoring  the  least  significant  bit,  break  tics  by 
choosing  the  first  one  from  the  left. 

2.  Pick  out  logo  -  1  bits  as  follows; 

(a)  If  possible  choose  the  Jog  n  -  1  bits  after  the  zeros  and  before  the  lsb, 

(b)  otherwise  if  possible  choose  the  logn  -  1  bits  preceding  the  longest  string  of  zeros, 

(c)  otherwise  choose  the  last  logn  -  1  bits  of  the  string  of  zeros  (note  that  in  this  case 
more  than  n  -  2 logn  bits  arc  zeros). 

3.  Treat  these  bits  as  a  number  (it  will  be  in  the  range  O...5),  call  this  number  s,  and  the 
sequence  of  bits  at. 

•I.  Remove  the  bits  of  s  from  w,  extend  the  chosen  string  of  zeros  on  the  ripht  (left)  by  a  01 
(10)  if  the  bits  were  removed  from  the  right  (left)  of  the  block  of  zeros,  and  cyclic  shift 
the  resulting  string  so  that  s  bits  appear  after  the  longest  string  of  zeros,  this  specifies 
the  row. 

Symbolically,  we  map  w  =  z0fca,y6  to  row  uOfc+1lu,  or  we  map  to  =  za,0fcy b  to  row  ulOt+1v, 
with  ybz  =  vu  and  |u|  =  s.  (Note  that  we  map  to  a  row  with  a  unique  longest  string  of  zeros 
not  straddling  the  bit  which  is  at  the  level  of  the  butterfly  node.)  It  is  easy  to  see  that  the 
least  significant  bit  of  to,  6,  is  somewhere  in  the  representation  of  the  row.  We  choose  the  level 
in  the  row  to  correspond  to  the  position  of  b  in  the  row’s  representation. 

We  must  argue  that  the  mapping  achieves  condition  1  and  2  above. 

First,  we  introduce  some  more  notation.  We  define  a  necklace  to  be  a  set  of  shuffle-exchange 
nodes  which  are  connected  only  by  shuffle  edges.  Alternatively,  a  necklace  is  a  set  of  nodes 
having  labels  which  are  cyclic  shifts  of  each  other.  A  necklace’s  label  is  the  lexicographically 
minimum  label  of  its  nodes.  We  can  specify  a  shuffle-exchange  node  by  the  label  of  its  necklace 
and  the  position  of  the  least  significant  bit  of  the  node’s  label  in  the  necklace’s  label. 
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We  define  the  domain  of  a  butterfly  node  to  be  the  set  of  shuffle-exchange  nodes  that  arc 
mapped  to  it  by  our  mapping. 

Now  we  show  that  the  mapping  is  at  most  two  to  one.  That  is,  given  a  butterfly  node  (p%  r) 
we  can  describe  at  most  two  shuffle-exchange  nodes  that  could  possibly  be  mapped  to  (p,  r) 
as  follows.  Recall  that  a  butterfly  node  (p,  r)  has  all  the  bits  of  w  in  r’s  binary  representation 
except  for  a,.  And  these,  we  recover  by  fi  ,ding  the  length  of  the  string  after  the  longest  group  of 
zeros  in.  r’s  binary  representation  not  straddling  the  ;>th  bit.  We  know  that  we  have  to  reinsert 
them  either  direcliy  before  or  directly  after  that  group  of  zeros.  This  gives  us  all  the  bits  of 
the  domain  nodes  except  for  &  cyclic  shift  uncertainty.  Thus,  the  domain  of  (p,  r)  can  only  be 
nodes  from  two  necklaces.  Furthermore,  the  least  significant  bit  of  the  nodes’  labels  is  uniquely 
specified  by  the  place  where  the  pth  bit  of  r’s  binary  representation  occurs  in  the  necklaces’ 
labels.  Thus  only  two  shuffle-exchange  nodes  can  be  mapped  to  any  node  in  the  butterfly. 

Finally,  we  argue  that  we  map  at  most  a  constant  number  of.shufflc  exchange  nodes  with 
distant  neighbors  to  any  butterfly  row*. 

Notice  that  wo  always  ignore  the  value  of  the  least  significant  bit  in  the  mapping  of  shuffle- 
exchange  nodes  to  butterfly  nodes.  Thus  the  mapping  maps  two  shuffle-exchange  nodes  to  two 
nodes  that  only  differ  in  the  bit  that  can  currently  be  changed  by  a  butterfly  edge.  Thus,  any 
exchange  edge  needs  only  flip  the  bit  at  the  node's  level,  which  only  requires  a  path  of  length 
2.  Thus  all  exchange  edges  arc  embedded  in  short  paths. 

Now  consider  the  shuffle  edges.  We  show  that  at  most  a  constant  number  of  shuffle  edges 
leave  any  row  of  the  butterfly.  (It  is  easy  to  sea  that  all  the  shuffle  edges  in  a  row  are  mapped 
to  single  edges  in  the  butterfly  graph.)  Again,  consider  the  inverse  mapping  of  a  butterfly  node, 
(p,r),  to  two  sbufflc-exciiange  nodes.  The  necklaces  of  the  domain  nodes  of  row  r’s  nodes,  aro 
the  same  for  most  of  the  row.  They  change  only  at  certain  transition  levels  in  the  row;  levels, 
p,  in  the  row  where  the  position  of  the  longest  string  of  zeros  not  straddling  p  changes,  or  levels 
in  the  row  where  we  become  unsure  or  sure  of  which  side  of  the  zeros  to  replace  the  removed 
bits,  a,. 

The  position  of  the  longest  string  of  zeros  not  straddling  p  only  changes  at  two  points;  inside 
the  row’s  unique  longest  string  of  zeros.  When  the  row  level  is  within  Iogn  bit  positions  to  the 
right  of  the  longest  string  of  zeros,  we  know  that  pieces  of  two  shuffle-exchange  necklaces  could 
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have  been  mapped  to  the  row.  Outside  this  range  we  know  that  only  one  necklace  is  mapped  to 
the  row:  Inside  the  group  of  zeros  the  bits  were  definitely  taken  out  before  the  group  of  zeros, 
and  further  to  the  right  they  were  definitely  taken  out  after  the  group  of  zeros.  Thus  entering 
this  stretch  and  leaving  this  stretch  gives  us  two  more  bad  levels.  Thus  we  have  four  transition 
levels  in  all,  and  for  each  of  these  at  most  four  necklaces  could  enter  or  leave  the  row  at  any 
of  these  levels.  Thus  at  most  16  long  shuffle  edges  can  have  endpoints  in  this  row.  (Careful 
counting  can  reduce  this  number  to  6.) 

Thus  at  most  16  long  edges  arc  adjacent  to  any  row  of  the  butterfly.  This  satisfies  condition 
2,  above. 

Thus,  the  shuffle-exchange  graph  can  be  embedded  in  the  butterfly  with  constant  congestion. 

3.5.4  Layouts  for  the  shuffle-exchange  graph  with  optimal  area  and  volume 

The  A'-nodc  butterfly  can  be  laid  out  in  0(A'3/  log3  N )  area  (trivialiy).and  in  0(Af3/2/  log3/3  Af) 
volume  (100).  Since  the  AF-node  shuflle-cxchange  graph  can  be  embedded  in  the  AF-node  but¬ 
terfly  with  constant  congestion,  we  can  simply  blowup  these  layouts  by  a  constant  factor  to 
obtain  layouts  for  the  shuffle-exchange  graph  with  ec  jivalent  area  and  volume. 

3.5.5  A  work  preserving  emulation  of  a  shuffle-exchange  graph 

We  construct  an  0(log  N)-  step  work-preserving  simulation  of  the  shuffle-exchange  graph  on 
the  butterfly  by  first  embedding  the  shuffle-exchange  graph  in  an  AF  log  AF-node  butterfly  with 
constant  congestion,  and  then  embedding  the  AF  log  AF-node  butterfly  in  an  AF-node  butterfly 
in  the  natural  way.  It  is  not  difficult  to  show  that  the  Af-nodc  butterfly  can  then  simulate  the 
Arlg  AT-node  shuffle-exchange  in  0(log  N)  steps.  Whether  cr  not  there  is  a  real-time  emulation 
remains  an  interesting  open  question. 


3.6  Emulations  by  shuffle-exchange  graphs 

3.6.1  Work  preserving  emulations  of  arbitrary  binary  trees 

It  is  well  known  that  the  shuffle-exchange  graph  can  emulate  a  complete  binary  tree  in  real 
time.  Thus  by  the  results  of  Section  3.4,  we  know  that  there  is  an  0(loglogAF)-time  work- 
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preserving  emulation  of  the  class  of  binary  trees  on  the  shuffle-exchange  graph.  Whether  or  not 
this  emulation  can  be  made  real-time  remains  an  open  question. 

3.6.2  Embedding  little  butterflies  in  the  shuffle-exchange  graph 

In  this  section  we  show  how  to  embed  Af/logA//  distinct  A/logAf-nodc  butterfly  graphs  in  an 
jV  =  A/2  shuffle-exchange  graph  with  load  /  —  2,  congestion  c  =  0(1),  and  dilation  d  =  3.  A 
similar  result  was  proved  by  Raghunathan  and  Saran  (80).  We  assume  that  M  =  2*.  Thus  each 
row  of  the  butterfly  can  be  represented  by  a  kbit  string,  and  each  node  of  the  shuffle-exchange 
can  be  represented  by  a  2Jfc-bit  string. 

To  map  A// log  A/  butterflies  to  the  shuffle-exchange  graph,  we  use  the  following  easily 
proven  lemma. 

Lemma  71  The  act  of  k  =  log  M-bit  strings  has  at  least  A//21ogA/  disjoint  subsets  each 
containing  log  A  f  distinct  strings  which  nre  cyclic  shifts  of  each  other. 

For  each  of  these  subsets  we  pick  the  lexicographically  minimum  string  to  represent  the 
subsets.  We  associate  the  A// log  A/  butterflies  two  to  one  with  the  A//21ogA/  representative 
strings.  Say  butterfly  i  is  associated  with  string  u>*.  We  map  a  node  (p,r)  in  butterfly  i  to  a 
shuffle-exchange  node  by  shuffling  the  bits  of  w;  with  the  bits  of  r’s  representation,  and  choosing 
the  current  bit  to  be  under  the  image  of  rp.  That  is,  node  (p,  r)  in  butterfly  i  is  mapped  to 
shuffle-exchange  node  ritej...T>u>J,...rnu>Jl. 

From  a  shuffle-exchange  node  we  can  recover  the  representative  string  u/;  by  picking  out 
every  other  bit  and  shifting  to  the  lexicographically  minimum  string.  We  find  the  row  string  by 
picking  out  the  other  bits  and  shifting  by  the  same  amount.  The  position  in  the  row  is  dearly 
the  number  of  shifts  we  used  to  get  to  w;  and  the  row  number. 

To  finish,  we  observe  that  each  edge  in  any  of  the  butterflies  is  mapped  to  a  path  of  length 
at  most  three  in  the  shuffle-exchange  graph  since  we  either  shift  twice  to  reach  (;>+ 1,  r)’s  image, 
or  we  exchange  the  current  bit  and  shift  twice  to  reach  (p+  l,ri..7^...rn)’s  image. 

Thus  we  can  embed  \fNf\o^y/N  (\/Wlog\/5V)-node  butterflies  in  an  JV-node  shuffle- 
exchange  with  load  2,  congestion  0(1),  and  dilation  3. 

This  technique  can  be  extended  to  prove  that  for  any  constant  0  <  e  <  1,  N*  distinct  N 1-t 
butterfly  graphs  can  be  embedded  in  an  N-node  shuffle-exchange. 
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3.6.3  Application  to  sorting  on  a  shuffle-exchange  graph 

U  is  known  that  an  A'-nodc  butterfly  can  sort  Af  packets  with  high  probability  in  O(logA') 
steps  [53,  76,  S  t].  The  result  does  not  directly  extend  to  the  shuffle-exchange  graph  because  the 
shuffle-exchange  graph  does  not  have  the  nice  recursive  structure  possessed  by  the  butterfly. 
However,  by  combining  the  embedding  result  of  Section  3.6.2,  the  butterfly  sorting  algorithm 
in  [63],  and  the  columnsort  algorithm  of  [47],  we  can  obtain  an  algorithm  for  sorting  Af  packets 
on  an  jY-nodc  shuffle-exchange  in  0(log  A')  steps  with  high  probability. 

3.6.4  Real  time  emulations  of  arrays 

Hy  combining  a  single  level  of  the  kind  of  analysis  in  Section  3.5.3  with  the  result  of  Section  3.6.2, 
we  can  emulate  an  array  in  real  time  on  a  shuffle-exchange  graph.  This  is  despite  the  fact  that 
any  0(1)  to  1  embedding  of  an  A'-node  array  (with  dimension  2  or  more)  in  a  shuffle  exchange 
graph  has  dilation  fl(loglogAf)  [11], 

3.6.5  A  work  preserving  emulation  of  the  butterfly 

By  using  standard  techniques  in  routing  normal  hypercubc  algorithms,  it  is  easily  shown  that 
there  is  an  0(log  /V)-step  work-preserving  simulation  of  a  butterfly  on  a  shuffle-exchange  graph. 
Whether  or  not  there  is  a  real-time  simulation  remains  an  important  open  question. 
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Minimum-cost  spanning  tree 


4.1  Introduction 

la  this  chapter  we  show  that  minimum-cost  spanning  tree  is  a  special  case  of  the  closed  semiring 
path-finding  problem  [1,  sections  5.6-5.9J.  For  a  graph  of  n  vertices,  the  path-finding  problem 
can  be  solved  sequentially  in  0(n3)  steps  by  a  dynamic  programming  algorithm  [37, 66]  of  which 
the  algorithms  of  Floyd  [25]  and  Warshall  [99]  arc  special  cases.  This  dynamic  programming 
algorithm  has  a  well  known  0(n)  step  implementation  on  an  n  x  n  incsh-conncctcd  computer 
[6,  19,  22,  30,  86]. 

Previously  known  minimum-cost  spanning  tree  algorithms  for  the  mesh  [6,  61]  are  based 
on  the  recursive  algorithm  of  Boruvka  (also  attributed  to  Sollin)  [91,  pp.  71-83],  which  is 
complicated  to  implement.  For  example,  the  algorithm  of  [6]  achieves  O(n)  steps  by  reducing 
the  fraction  of  the  mesh  in  use  by  a  constant  factor  at  each  recursive  call.  The  dynamic 
programming  algorithm  has  the  same  asymptotic  running  time  but  is  much  simpler. 

The  rest  of  this  chapter  consists  of  two  short  sections.  In  Section  4.2  we  show  how  to  cast 
minimum-cost  spanning  tree  as  a  path-finding  problem.  In  Section  4.3,  we  briefly  describe  an 
0(n )  step  mesh  algorithm  to  solve  the  problem. 


This  chapter  describes  joint  research  with  Serge  Plotkin  [65]. 
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4.2  Reduction  to  a  path-finding  problem 

In  this  section  we  define  the  minimum-cost  spanning  tree  problem  and  a  related  path-finding 
problem.  We  give  a  recurrence  for  solving  the  path-finding  problem  via  dynamic  programming. 
We  then  prove  that  the  solution  to  the  path-finding  problem  contains  the  solution  to  the 
minimum-cost  spanning  tree  problem. 

Given  ar«  n-nodc  connected  1  undirected  graph  G  =  (!',£),  where  V  is  the  set  {l,.,.n}, 
and  where  each  edge  {i,  j)  in  E  has  cost  C$  =  C$,  the  minimum-cost  spanning  tree  problem  is 
to  find  a  subgraph  that  connects  the  vertices  in  V  such  that  the  sum  of  the  costs  of  the  edges  in 
the  subgraph  is  minimum.  We  assume  that  the  edge  costs  arc  unique.  (If  not,  lexicographical 
information  can  be  added  to  make  them  unique.)  For  convenience,  we  also  assume  that  if  {*,  j} 
is  not  in  E  then  it  has  cost  Cfj  »  C$  =  oo. 

The  path-finding  problem  is  to  compute  the  cost  Cjj  for  each  1  <  i,j\k  <  n  of  the  shortest 
(lowest-cost)  path  from  i  to  j  that  passes  through  vertices  only  in  the  set  (l, where  the 
cost  of  a  jxtlh  is  defined  to  be  the  highest  cost  of  any  edge  on  the  path.  For  any  i  and  j,  the 
shortest  path  from  »  to  j  with  no  intermediate  vertex  higher  than  k  either  passes  through  k  or 
does  not.  In  the  first  case,  the  cost  of  the  shortest  path  from  «  to  j  is  cither  the  cost  of  the 
shortest  path  from  i  to  k  or  the  cost  of  the  shortest  path  from  k  to  j,  whichever  is  higher.  In 
the  second  ease,  we  have  C-j  =  Thus,  Cjj  can  be  computed  by  the  recurrence 

The  following  theorem  shows  that  the  unique  minimum-cost  spanning  tree  can  be  recovered 
from  the  costs  of  the  shortest  paths. 

Theorem  72  /In  edge  {i,j}  is  in  the  unique  minimum-cost  spanning  tree  if  and  only  if  Cfj  = 

r\". 

Wj. 

Proof:  The  proof  has  two  parts.  We  first  show  that  if  {»,  j)  is  a  tree  edge  then  Cfj  =  Cfi.  We 
then  show  that  if  C,°  =  Cft  then  the  edge  {*,  j)  is  in  the  tree.  First,  assume  that  {S.g)  U  a 
tree  edge,  but  that  Cfy  ^  C[j.  Consider  the  cut  of  the  graph  that  {i,j}  crosses,  but  no  other 

1  For  simplicity,  we  assume  that  the  graph  is  connected.  The  same  technique  will  find  a  minimum-cost  spanning 
forest  of  a  disconnected  graph. 
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tree  edge  crosses.  Since  Cfj  5^  C,"-,  there  must  be  some  path  from  i  to  j  whose  highest-cost 
edge  has  cost  CJj  <  Cfj.  Hence,  every  edge  on  this  path  has  cost  less  than  Cjj.  This  path  must 
cross  the  cut  at  least  once.  Replacing  the  edge  {i,j}  by  any  edge  on  the  path  that  crosses  the 
cut  reduces  the  cost  of  the  tree,  a  contradiction.  Conversely,  assume  that  Cfj  =  C,"-,  but  that 
{i,  j}  is  not  a  tree  edge.  Adding  the  edge  {itj)  to  the  tree  forms  a  cycle  whose  highest-cost 
edge  costs  more  than  than  Cfj.  Replacing  this  edge  by  {i,  j}  yields  a  tree  with  smaller  cost,  a 
contradiction.  □ 

4.3  Implementation  on  a  mesh-connected  computer 

In  this  section  we  give  a  short  description  of  an  Ofn)  step  algorithm  for  solving  the  minimum- 
cost  spanning  tree  problem  on  an  n  x  n  mesh-connected  computer.  We  assume  that  the  diagonal 
element  in  each  mesh  row  can  broadcast  a  value  to  the  other  elements  of  the  row  in  a  single 
step.  This  type  of  broadcast  can  be  simulated  by  a  mesh  without  this  capability  by  slowing  the 
algorithm  down  by  a  constant  factor  (45, 59, 60).  The  algorithm  proceeds  as  follows.  We  assume 
that  the  input  graph  is  given  in  the  form  of  a  matrix  of  edge  costs  C°  which  enters  row-by-row 
through  the  top  of  the  mesh.  Matrix  row  i  is  modified  as  it  passes  over  rows  1  through  i  —  1 
and  is  stored  when  it  reaches  mesh  row  i.  When  matrix  row  i  passes  over  mesh  row  A:,  the  value 
Cfk~]  is  broadcast  right  and  left  from  the  diagonal  cell  (k,k).  Each  cell  (fc,j),  1  <j  <n  knows 
the  value  of  Cfo1  and  computes 

Cij  =  mi  n  {C£f 1 ,  max{C&~ 1 ,  1 } } . 

which  is  passed  down  to  the  next  mesh  row.  After  reaching  mesh  row  i,  matrix  row  i  stays  there 
until  each  matrix  row  /,»</<  n,  above  it  has  passed  over  it  and  then  continues  to  propagate 
down,  passing  over  the  rest  of  the  matrix  rows.  The  output  matrix  Cn  exits  row-by-row  from 
the  bottom  of  the  mesh.  By  Theorem  72,  the  adjacency  matrix  of  the  minimum-cost  spanning 
tree  can  be  constructed  by  comparing  the  input  and  output  matrices. 
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Directions  for  further  research 


Packet  routing  algorithms,  distributed  random-access  machines,  and  network  emulations  arc 
the  objects  of  ongoing  research.  This  section  present*  some  of  the  open  questions  and  very 
recent  results  in  these  areas. 

Packet  routing 

Many  challenging  routing  and  sorting  problems  remain  to  be  solved.  As  we  mentioned  in 
Section  1.2,  there  is  no  efficient  algorithm  known  for  finding  a  schedule  of  length  0(c+ d)  for  a 
set  of  packets  whose  paths  have  congestion  c  and  dilation  d.  Also,  there  is  no  known  algorithm 
simpler  than  that  of  Section  1.5  for  routing  on  an  A'-node  butterfly  in  0(log  JV)  steps  using 
constant-size  queues.  A  simple  FIFO  queueing  discipline  performs  well  in  simulations  but  has 
eluded  analysis. 

Although  Sections  1.9  and  3.6  provide  randomized  algorithms  for  sorting  on  the  butterfly 
and  shuffle-exchange  graphs  in  O(logAf)  steps  using  constant-size  queues,  there  arc  no  known 
deterministic  algorithms  for  routing  or  sorting  on  live  butterfly  or  shuffle-exchange  graphs  in 
O(logJV)  steps,  even  if  large  queues  arc  allowed.  Recently  Cyper  and  Plaxton  (21)  discovered 
a  deterministic  algorithm  for  sorting  on  the  shuffle-exchange  graph  in  0(logAf(loglogJV)2) 
steps.  Also,  Upfal  [95]  recently  found  a  deterministic  algorithm  for  routing  cn  a  multibuttcrjiy 
network  in  0(logAr)  steps  using  constant-size  queues.  However,  Upfal’s  algorithm  docs  not 
combine  multiple  packets  with  the  same  destination.  The  only  known  deterministic  algorithm 
for  sorting  N  packets  on  an  Ar-node  bounded- degree  network  in  0(log  Ar)  steps  [47]  is  based  on 
the  complicated  AKS  network  [2]. 

Routing  in  the  presence  of  faults  has  become  an  area  of  intense  research.  Typically  it 
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Is  assumed  that  some  of  the  edges  or  some  of  the  nodes  cannot  transmit  packets,  and  that 
these  failures  arc  easily  detected.  H  is  also  sometimes  assumed  that  the  faults  are  distributed 
randomly  throughout  the  network.  Some  of  the  recent  results  are  summarised  below. 

In  19S7,  llastad,  Leighton,  and  Newman  (31)  presented  a  simple  randomized  on-line  algo¬ 
rithm  for  embedding  an  jV-nodc  hypercube  in  A'-nodc  faulty  hypercubc  with  constant  load, 
congestion,  dilation.  Faults  are  assumed  to  occur  at  the  nodes  randomly  and  independently 
with  some  fixed  probability  p.  As  a  consequence,  the  faulty  by  per  cube  can  emulate  a  fault- 
free  hypercubc  with  constant  slowdown.  Thus,  it  can  route  any  permutation  of  iV  packets  in 
O(logJV)  time  using  constant-size  queues  on  the  edges.  In  19S9  they  discovered  an  O(logrV)- 
step  algorithm  for  routing  directly  on  the  faulty  hypercubc  (32).  The  algorithm  is  adaptive  in 
the  sense  that  packets  alter  their  paths  to  avoid  faults. 

Rabin  (77)  designed  a  fault-tolerant  routing  algorithm  for  the  hypercubc  using  error- 
correcting  codes.  His  idea  is  to  break  each  message  into  smaller  pieces  and  encode  them  so 
that  the  original  message  can  be  recovered  from  any  majority  of  them.  In  the  course  of  rout¬ 
ing,  pieces  are  lest  if  they  attempt  to  use  faulty  edges,  enter  full  queues,  or  fail  to  reach  their 
destinations  quickly.  Despite  these  losses,  with  high  probability  a  majority  of  the  pieces  for 
each  message  reach  their  destinations.  The  algorithm  routes  a  permutation  of  N  messages 
in  O(logAf)  time  on  an  N- node  hypercube  using  constant-size  queues  at  the  nodes.  Edges 
are  assumed  to  fail  randomly  and  independently  with  probability  1/log 7  N.  In  this  scheme, 
each  message  is  broken  into  logAr  pieces.  Attached  to  each  piece  is  a  0(logJV)-bit  ticket  of 
error-correcting  information.  Thus,  for  the  scheme  to  be  efficient,  messages  must  be  at  least 
fl(log2  N)  bits  long, 

Raghavan  (79)  considered  routing  permutations  on  a  faulty  mesh.  He  showed  that  on  a 
VT?  x  VN  mesh  where  nodes  fail  randomly  and  independently  with  some  fixed  probability 
p  <  .29,  every  packet  that  can  reach  its  destination  does  so  in  0(VN  log  JV)  time.  The  algorithm 
is  randomized  and  uses  queues  of  size  0(log2  N).  Itaghavan’s  result  was  improved  by  Karlin, 
Leighton,  Raghavan,  and  Thomborson,  who  showed  that  after  sustaining  k  faults,  a  mesh  can 
route  any  permutation  in  min{\/5v  +0(it2),Ar}  time. 

In  (52)  we  described  an  adaptive  algorithm  for  routing  on  Upfal’s  multibuttcrfly  (95)  in  the 
presence  of  faults.  We  proved  that  an  AMnput  multibutterfly  can  sustain  k  faults  and  still  route 
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log  jV  permutations  between  some  set  of  Af  -  0[k)  inputs  and  N  -  0(k)  outputs  in  0(log  A’) 
time.  The  multibutterfly  is  even  more  resilient  to  randomized  faults.  A  specially  modified  twin 
butterfly  can  tolerate  A'3^4  faults  at  internal  nodes,  and  still  route  any  log  A'  permutations 
of  AT  packets  in  O(logA’)  time.  Before  routing  begins,  faulty  regions  arc  spliced  out  of  the 
multibutterfly.  Thereafter,  the  packets  route  as  if  there  were  no  faults. 

Distributed  random-access  machines 

To  date,  all  DRAM  algorithms  solve  graph  theoretic  problems.  It  is  natural  to  wonder  whether 
there  are  other  problem  domains  for  which  communication-efficient  algorithms  can  be  designed. 
One  difficulty  faced  in  designing  DRAM  algorithms  for  other  domains  is  the  lack  of  uniform- 
cost  shared  memory  in  the  model.  Unfortunately  we  haven’t  found  any  meaningful  way  to 
incorporate  PRAM-like  memory  into  the  model. 

Emulations 

Chapter  3  leaves  open  several  challenging  problems.  For  example,  we  do  not  know  if  there  a 
real-time  simulation  of  a  complete  ternary  tree  on  a  complete  binary  tree.  Another  unresolved 
question  is  whether  there  is  a  class  of  bounded-degree  graphs  that  can  efficiently  emulate  the 
class  of  all  bounded-degree  graphs.  If  so,  the  graphs  in  this  universal  class  must  be  expanders. 

Schwabe  recently  resolved  a  long  open  question  by  proving  that  the  butterfly  and  shuffle- 
cxchange  graphs  are  computationally  equivalent(85).  He  showed  that  each  network  can  perform 
a  real-time  emulation  of  the  other.  The  proof  combines  the  techniques  of  embedding  little 
butterflies  in  a  shuffle-exchange  graph  from  Section  3.6  (and  vice  versa)  with  the  overlap  strategy 
from  Section  3.5.2  used  by  the  butterfly  to  emulate  the  mesh. 
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